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Chapter 1 

The Fu^amentals of HTK 

^ J Speech Data Transcription 

Training Tools 



oj^^^ o^S^o <»S^^« 

^ t t 



'Recogniser 



Unknown Sft^ech Transcription 

HTK is a toolkit for building Hidden Markov^odels (HMMs) . HMMs can be used to model 
any time series and the core of HTK is similarly general-purpose. However, HTK is primarily 
designed for building HMM-based speech processing t«^}^, in particular recognisers. Thus, much of 
the infrastructure support in HTK is dedicated to this ta^. As shown in the picture above, there 
are two major processing stages involved. Firstly, the iJSK training tools are used to estimate 
the parameters of a set of HMMs using training utterances and their associated transcriptions. 
Secondly, unknown utterances are transcribed using the HTK recognition tools. 

The main body of this book is mostly concerned with tf(e)iiechanics of these two processes. 
However, before launching into detail it is necessary to understi^ift^ some of the basic principles of 
HMMs. It is also helpful to have an overview of the toolkit and i^Hpave some appreciation of how 
training and recognition in HTK is organised. ^ 

This first part of the book attempts to provide this information\_ferthis chapter, the basic ideas 
of HMMs and their use in speech recognition are introduced. The follSta^mg chapter then presents a 
brief overview of HTK and, for users of older versions, it highlights th^'^^in differences in version 
2.0 and later. Finally in this tutorial part of the book, chapter 3 desc^es how a HMM-based 
speech recogniser can be built using HTK. It does this by describing the coil|itruction of a simple 
small vocabulary continuous speech recogniser. ^ 

The second part of the book then revisits the topics skimmed over here find discusses each in 
detail. This can be read in conjunction with the third and final part of the b^ok which provides 
a reference manual for HTK. This includes a description of each tool, summar(S) of the various 
parameters used to configure HTK and a list of the error messages that it generai^L when things 
go wrong. <'^^ 

Finally, note that this book is concerned only with HTK as a tool-kit. It does not provide 
information for using the HTK libraries as a programming environment. 
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1.1 General Principles of HMMs 



Concept; a sequence of symbols 



•))) ) [H^/^^iHH^^^'iHw^ 




' Fig. 1.1 Message 

Encoding/Decoding 



V, 



Speech recognition systems generally assume that the speech signal is a reahsation of some mes- 
sage encoded as a sequence of one oY^jwinre symbols (see Fig. 1.1). To effect the reverse operation of 
recognising the underlying symbol sequence given a spoken utterance, the continuous speech wave- 
form is first converted to a sequence oneMUally spaced discrete parameter vectors. This sequence of 
parameter vectors is assumed to form aff^g^act representation of the speech waveform on the basis 
that for the duration covered by a single ^j^tor (typically 10ms or so), the speech waveform can 
be regarded as being stationary. Although ,^J4i^ is not strictly true, it is a reasonable approxima- 
tion. Typical parametric representations in cmamon use are smoothed spectra or linear prediction 
coefficients plus various other representations served from these. 

The role of the recogniser is to effect a mapping between sequences of speech vectors and the 
wanted underlying symbol sequences. Two problei^ make this very difficult. Firstly, the mapping 
from symbols to speech is not one-to-one since differ^r^ underlying symbols can give rise to similar 
speech sounds. Furthermore, there are large variafiotis in the realised speech waveform due to 
speaker variability, mood, environment, etc. Secona^fj^e boundaries between symbols cannot 
be identified explicitly from the speech waveform. Hekce, it is not possible to treat the speech 
waveform as a sequence of concatenated static patterns. Qs) 

The second problem of not knowing the word boundary locations can be avoided by restricting 
the task to isolated word recognition. As shown in Fig. 1.2, tki^ implies that the speech waveform 
corresponds to a single underlying symbol (e.g. word) chosen rrom>a fixed vocabulary. Despite the 
fact that this simpler problem is somewhat artificial, it neverthslegs has a wide range of practical 
applications. Furthermore, it serves as a good basis for introduciagAhe basic ideas of HMM-based 
recognition before dealing with the more complex continuous sp^^^case. Hence, isolated word 
recognition using HMMs will be dealt with first. 

1.2 Isolated Word Recognition O 

Let each spoken word be represented by a sequence of speech vectors or observations O, defined as 

O = oi,02, . ■ . ,ot (y (1.1) 

where Ot is the speech vector observed at time t. The isolated word recognition ^^>blem can then 
be regarded as that of computing 

argmax{P(u;,|0)} (1.2) 

i 

where Wi is the i'th vocabulary word. This probability is not computable directly but using Bayes' 
Rule gives 

P{0\w,)P{w,) 

P[w,\0) = ^^^^ (1.3) 

Thus, for a given set of prior probabilities P(wi), the most probable spoken word depends only 
on the likelihood P(0\wi). Given the dimensionality of the observation sequence O, the direct 
estimation of the joint conditional probability P{oi, 02, . . . \wi) from examples of spoken words is 
not practicable. However, if a parametric model of word production such as a Markov model is 
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assumed, then estimation from data is possible since the problem of estimating the class conditional 
observation densities P{0\wi) is replaced by the much simpler problem of estimating the Markov 
model parameters. 



Concept: a single word 




J 



I Recognise ^ 



litW 



1.2 Isolated Word 
Problem 



I n HMM based speech recognitTc>n\ it is assumed that the sequence of observed speech vectors 
corresponding to each word is generared^by a Markov model as shown in Fig. 1.3. A Markov model 
is a finite state machine which changes-Vbite once every time unit and each time t that a state j 



is entered, a speech vector Oj is general 
transition from state i to state j is also 



'from the probability density bj{ot). Furthermore, the 



ttij. Fig. 1.3 shows an example of this proc^ 



'babilistic and is governed by the discrete probability 
where the six state model moves through the state 
sequence X = 1, 2, 2, 3, 4, 4, 5, 6 in order to geK^]>ate the sequence Oi to Og. Notice that in HTK, the 
entry and exit states of a HMM are non-emittirtg^iThis is to facilitate the construction of composite 
models as explained in more detail later. \ 

The joint probability that O is generated by model M moving through the state sequence 
X is calculated simply as the product of the transiijfi^ probabilities and the output probabilities. 
So for the state sequence X in Fig. 1.3 

P{0, X\M) = ai2fo2(0l)a2262(^)a23&3(03) . . . (1.4) 



However, in practice, only the observation sequence O is Boown and the underlying state sequence 
X is hidden. This is why it is called a Hidden Markov Modii. 



Markov 
Model 
M 




Observation 
Sequence 



^'t>2(Ol)^b2(0 2) jbjiOg) b4(04)\b4(05) ^^^(Og) • ^ 

D D D D D D 



o, o. 



Fig. 1.3 The Markov Generation Model 



Given that X is unknown, the required likelihood is computed by summing over all possible 
state sequences X = a;(l), a;(2), x(3), . . . ,x{T), that is 

T 

Y\_^x(t){ot)a^(t)x(t+i) (1-5) 

X t=i 
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where x{0) is constrained to be the model entry state and x{T + 1) is constrained to be the model 
exit state. 

As an alternative to equation 1.5, the likelihood can be approximated by only considering the 
most likely state sequence, that is 



P{0\M) = max' 



i;(0)2;(l 



(1.6) 



t=i 



Although the direct computation of equations 1.5 and 1.6 is not tractable, simple recursive 
procedures exist whi^jallow both quantities to be calculated very efficiently. Before going any 
further, however, notic^Jjat if equation 1.2 is computable then the recognition problem is solved. 
Given a set of models corresponding to words Wi, equation 1.2 is solved by using 1.3 and 
assuming that 

P{0\w,)^P{Om. (1.7) 

All this, of course, assumes,^at the parameters {aij} and {bj{ot)} are known for each model 
Mi . Herein lies the elegance aiKTnpwer of the HMM framework. Given a set of training examples 
corresponding to a particular mo^ei, the parameters of that model can be determined automatically 
by a robust and efficient re-estiri5Mlon procedure. Thus, provided that a sufficient number of 
representative examples of each wo^^can be collected then a HMM can be constructed which 
implicitly models all of the many sourqeisvof variability inherent in real speech. Fig. 1.4 summarises 
the use of HMMs for isolated word recotaition. Firstly, a HMM is trained for each vocabulary word 
using a number of examples of that wordv^n this case, the vocabulary consists of just three words: 
"one", "two" and "three". Secondly, to red^nise some unknown word, the likelihood of each model 
generating that word is calculated and the if^t likely model identifies the word. 

Train}^ Example 



(a) Training 



les 



three 



DDD(^ DDDDDD 
DDDD® DDDDD 
D D D D D D D D 



one twj 

1. DDDDDD 

2. DDDD 

3. DDDDD 

Estimate 
Models 

Ml M- 
(b) Recognition 

UnknownO= DDDDDD 

^ \ 

I P(OlMi) PCOlMj) P(OlM3 




o 

% 



Choose Max 



Fig. 1.4 Using HMMs for Isolated Word 
Recognition 
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1.3 Output Probability Specification 



Before the problem of parameter estimation can be discussed in more detail, the form of the output 
distributions {bj{ot)} needs to be made explicit. HTK is designed primarily for modelling con- 
tinuous parameters using continuous density multivariate output distributions. It can also handle 
observation sequences consisting of discrete symbols in which case, the output distributions are 
discrete probabilities. For simplicity, however, the presentation in this chapter will assume that 
continuous density distributions are being used. The minor differences that the use of discrete 
probabilities entail are noted in chapter 7 and discussed in more detail in chapter 11. 

In common with g^st other continuous density HMM systems, HTK represents output distri- 
butions by Gaussian Nature Densities. In HTK, however, a further generalisation is made. HTK 
allows each observation'^rcctor at time t to be split into a number of S independent data streams 
Ost- The formula for com^Aing bj{ot) is then 



2. 



E 



1 



'J sni' 



(1.8) 



where Ms is the number of mixtum^mponents in stream s, Cjsm is the weight of the m'th compo- 
nent and J\f{-; /x, S) is a multivariafe^^aussian with mean vector fj, and covariance matrix S, that 



AA(o;/x,S)^^ 



1 



,-i(o-/x)'S \o-tJ.) 



(1.9) 



v^27r)"|S| 

where n is the dimensionality of o. 

The exponent 7^ is a stream weight^. It 9«f^ be used to give a particular stream more emphasis, 
however, it can only be set manually. No curfeat HTK training tools can estimate values for it. 

Multiple data streams are used to enable sepramte modelling of multiple information sources. In 
HTK, the processing of streams is completely geoeral. However, the speech input modules assume 
that the source data is split into at most 4 streaml^Chapter 5 discusses this in more detail but for 
now it is sufficient to remark that the default streairtsNare the basic parameter vector, first (delta) 
and second (acceleration) difference coefficients and log^energy. 

1.4 Baum- Welch Re-Estimation (S) 

To determine the parameters of a HMM it is first necessary to^ake a rough guess at what they 
might be. Once this is done, more accurate (in the maximurrNlikelihood sense) parameters can be 
found by applying the so-called Baum- Welch re-estimation forriijflae. 

o 



Single 
Gaussians 



M-component 
Gaussian 
mixture 




Fig. 1.5 Representing a Mixture 



Chapter 8 gives the formulae used in HTK in full detail. Here the basis of the formulae will 
be presented in a very informal way. Firstly, it should be noted that the inclusion of multiple 
data streams does not alter matters significantly since each stream is considered to be statistically 



often referred to as a codebook exponent. 
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independent. Furthermore, mixture components can be considered to be a special form of sub-state 
in which the transition probabihties are the mixture weights (see Fig. 1.5). 

Thus, the essential problem is to estimate the means and variances of a HMM in which each 
state output distribution is a single component Gaussian, that is 

If there was just one state j in the HMM, this parameter estimation would be easy. The maximum 
likelihood estimates ^^^Mj ^^id Sj would be just the simple averages, that is 

^ A, = ;^E°* (1-11) 



T 

t=i 



and \^ 



in proportion to the probability of the model being in state when the vector was observed. 
Thus, if Lj{t) denotes the probability of being in state j aVfime t then the equations 1.11 and 1.12 



V% = ;^E(o,-/x^.)(o,-/x^.)' (1.12) 

In practice, of course, there are rf5uft;iple states and there is no direct assignment of observation 
vectors to individual states because t^ underlying state sequence is unknown. Note, however, that 
if some approximate assignment of vef^rs to states could be made then equations 1.11 and 1.12 
could be used to give the required iniftaWalues for the parameters. Indeed, this is exactly what 
is done in the HTK tool called HInit. HfNiT, first divides the training observation vectors equally 
amongst the model states and then uses e^ffiations 1.11 and 1.12 to give initial values for the mean 
and variance of each state. It then finds th^^^aximum likelihood state sequence using the Viterbi 
algorithm described below, reassigns the obser^SJ^ion vectors to states and then uses equations 1.11 
and 1.12 again to get better initial values. Thi^process is repeated until the estimates do not 
change. ^ v» 

Since the full likelihood of each observation sequence is based on the summation of all possi- 
ble state sequences, each observation vector Of cor^ributes to the computation of the maximum 
likelihood parameter values for each state j. In other ^prds, instead of assigning each observation 
vector to a specific state as in the above approximation, j^&h observation is assigned to every state 

at^fi 

given above become the following weighted averages • > 

and \^ 

Er.i%(') <> 

where the summations in the denominators are included to give the requo'eovnormalisation. 

Equations 1.13 and 1.14 are the Baum-Welch re-estimation formulae fo^tne means and covari- 
ances of a HMM. A similar but slightly more complex formula can be deri-^d for the transition 
probabilities (see chapter 8). 

Of course, to apply equations 1.13 and 1.14, the probability of state occupation Lj(t) must 
be calculated. This is done efficiently using the so-called Forward-Backward algOmhm. Let the 
forward probability'^ cej{t) for some model M with N states be defined as *^ 

a,it) = P{oi,...,Ot,x{t)^j\M). (1.15) 

That is, aj(t) is the joint probability of observing the first t speech vectors and being in state j at 
time t. This forward probability can be efficiently calculated by the following recursion 



aj{t) 



ai{t - l)ai 

. i=2 



b,{ot). (1.16) 



Since the output distributions are densities, these are not really probabilities but it is a convenient fiction. 
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This recursion depends on the fact that the probabihty of being in state j at time t and seeing 
observation Ot can be deduced by summing the forward probabihties for all possible predecessor 
states i weighted by the transition probability a,y . The slightly odd limits are caused by the fact 
that states 1 and N are non-emitting'^. The initial conditions for the above recursion are 

ai(l) = l (1.17) 



for I < j < N and th^^nal condition is given by 

N-l 



aAr(T) 



i=2 



ai{T)aiN- 



Notice here that from the defi^tion of aj{t), 

P{0\M) ^ aN{T). 

Hence, the calculation of the for^^dprobability also yields the total likelihood P(0\M). 
The backward probability Pj {ty^ defined as 

P,{t)<§)p{ot+^, . . . ,OT\x{t) ^ 3,M). 



As in the forward case, this backward^raibability can be computed efficiently using the following 
recursion , 



(1.18) 



(1.19) 



(1.20) 



(1.21) 



m -- 

with initial condition given by 

for 1 < i < and final condition given by 

m) = 




jhj{ot+i)(3j{t+l) 



N-l 

E 

J=2 



(6. 



(!)■ 



(1.22) 



(1.23) 



(1.24) 



Notice that in the definitions above, the forward probability is a joint probability whereas the 
backward probability is a conditional probability. This somewhat>asymmetric definition is deliberate 
since it allows the probability of state occupation to be det^ra)ined by taking the product of the 
two probabilities. From the definitions, 

a,{t)P,{t) ^ P{0,x{t) ^ j\M).0 

P{x{t)^j\0,M) 
P{0,x{t)^j\M) 



Hence, 



P{0\M) 



o 



(1.25) 



(1.26) 



where P = P{0\M). 

All of the information needed to perform HMM parameter re-estimation using^ 
algorithm is now in place. The steps in this algorithm may be summarised as folio"* 



Baum-Welch 



1. For every parameter vector/matrix requiring re-estimation, allocate storage for the numerator 
and denominator summations of the form illustrated by equations 1.13 and 1.14. These storage 
locations are referred to as accumulators'^ . 



^ To understand equations involving a non-emitting state at time t, the time should be thought of as being t — 5t 
if it is an entry state, and t + 5t ii it is an exit state. This becomes important when HMMs are connected together 
in sequence so that transitions across non-emitting states take place between frames. 

* Note that normally the summations in the denominators of the re-estimation formulae are identical across the 
parameter sets of a given state and therefore only a single common storage location for the denominators is required 
and it need only be calculated once. However, HTK supports a generalised parameter tying mechanism which can 
result in the denominator summations being different. Hence, in HTK the denominator summations are always 
stored and calculated individually for each distinct parameter vector or matrix. 
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2. Calculate the forward and backward probabilities for all states j and times t. 

3. For each state j and time t, use the probability Lj{t) and the current observation vector Ot 
to update the accumulators for that state. 

4. Use the final accumulator values to calculate new parameter values. 

5. If the value of P = P{0\M) for this iteration is not higher than the value at the previous 
iteration then stop, otherwise repeat the above steps using the new re-estimated parameter 
values. 

All of the above cl^»mes that the parameters for a HMM are re-estimated from a single ob- 
servation sequence, tha^ftj a single example of the spoken word. In practice, many examples are 
needed to get good paraf^ter estimates. However, the use of multiple observation sequences adds 
no additional complexity to^the algorithm. Steps 2 and 3 above are simply repeated for each distinct 
training sequence. < 

One final point that shouMW mentioned is that the computation of the forward and backward 
probabilities involves taking thetrooduct of a large number of probabilities. In practice, this means 
that the actual numbers involve^'Wcome very small. Hence, to avoid numerical problems, the 
forward-backward computation is l^^puted in HTK using log arithmetic. 

The HTK program which implei^^ts the above algorithm is called HRest. In combination 
with the tool HInit for estimating infttfil values mentioned earlier, HRest allows isolated word 
HMMs to be constructed from a set oitrgiining examples using Baum- Welch re-estimation. 

v. 

1.5 Recognition and Vitei^ Decoding 



The previous section has described the basic io^siinderlying HMM parameter re-estimation using 
the Baum- Welch algorithm. In passing, it wa^noted that the efficient recursive algorithm for 
computing the forward probability also yielded as a^y-product the total likelihood P{0\AI). Thus, 
this algorithm could also be used to find the model j*5^ich yields the maximum value of P{0\AIi), 
and hence, it could be used for recognition. > 

In practice, however, it is preferable to base recogiritkin on the maximum likelihood state se- 
quence since this generalises easily to the continuous 'epeech case whereas the use of the total 
probability does not. This likelihood is computed using el^ntially the same algorithm as the for- 
ward probability calculation except that the summation is replaced by a maximum operation. For 
a given model M, let (f>j{t) represent the maximum likelihoo<Kof observing speech vectors Oi to 
Ot and being in state j at time t. This partial likelihood catr Im computed efficiently using the 
following recursion (cf. equation 1.16) 

0j(i) =max{(/.i(i- l)ay}6j(of).^ (1.27) 

where 

Ml) = l (1-28) 

cj,,{l) = aMoi) ^ (1-29) 

for 1 < j < N. The maximum likelihood P{0\M) is then given by 

0w(r) =max{0,(r)a,jv} O (1.30) 

<>> 

As for the re-estimation case, the direct computation of likelihoods leads to underflow, hence, 
log likelihoods are used instead. The recursion of equation 1.27 then becomes 

V'j(i) = max{'0i(i - 1) + log{aij)} + log{bj{ot)). (1.31) 

i 

This recursion forms the basis of the so-called Viterbi algorithm. As shown in Fig. 1.6, this algorithm 
can be visualised as finding the best path through a matrix where the vertical dimension represents 
the states of the HMM and the horizontal dimension represents the frames of speech (i.e. time). 
Each large dot in the picture represents the log probability of observing that frame at that time and 
each arc between dots corresponds to a log transition probability. The log probability of any path 
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is computed simply by summing tlie log transition probabilities and the log output probabilities 
along that path. The paths are grown from left-to-right column- by- column. At time t, each partial 
path ipi{t — 1) is known for all states i, hence equation 1.31 can be used to compute ipj{t) thereby 
extending the partial paths by one time frame. 
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Fig. 1.6 The Viterbi^lgorithm for Isolated Word 
Reccisnition 

. 

This concept of a path is extremely importai(£^and it is generalised below to deal with the 
continuous speech case. 

This completes the discussion of isolated word recognition using HMMs. There is no HTK tool 
which implements the above Viterbi algorithm directlyT^^stead, a tool called HVite is provided 
which along with its supporting libraries, HNet and iTRec, is designed to handle continuous 
speech. Since this recogniser is syntax directed, it can alssiperform isolated word recognition as a 
special case. This is discussed in more detail below. 



'•6 



1.6 Continuous Speech Recognition 

Returning now to the conceptual model of speech production and rec(tf^ition exemplified by Fig. 1.1, 
it should be clear that the extension to continuous speech simply invra^ffi«>connecting HMMs together 
in sequence. Each model in the sequence corresponds directly to theMssajmed underlying symbol. 
These could be either whole words for so-called connected speech recogninon or sub-words such as 
phonemes for continuous speech recognition. The reason for including th^[^on-emitting entry and 
exit states should now be evident, these states provide the glue needed to j^i^ models together. 

There are, however, some practical difficulties to overcome. The training data for continuous 
speech must consist of continuous utterances and, in general, the boundaries dividing the segments 
of speech corresponding to each underlying sub- word model in the sequence wiM- not be known. In 
practice, it is usually feasible to mark the boundaries of a small amount of dataJav hand. All of 
the segments corresponding to a given model can then be extracted and the ismm^d word style 
of training described above can be used. However, the amount of data obtainable in this way is 
usually very limited and the resultant models will be poor estimates. Furthermore, even if there 
was a large amount of data, the boundaries imposed by hand-marking may not be optimal as far 
as the HMMs are concerned. Hence, in HTK the use of HInit and HRest for initialising sub- word 
models is regarded as a bootstrap operation'. The main training phase involves the use of a tool 
called HERest which does embedded training. 

Embedded training uses the same Baum- Welch procedure as for the isolated case but rather 
than training each model individually all models are trained in parallel. It works in the following 
steps: 

^ They can even be avoided altogether by using a flat start as described in section 8.3. 
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1. Allocate and zero accumulators for all parameters of all HMMs. 

2. Get the next training utterance. 

3. Construct a composite HMM by joining in sequence the HMMs corresponding to the symbol 
transcription of the training utterance. 

4. Calculate the forward and backward probabilities for the composite HMM. The inclusion 
of intermediate non-emitting states in the composite model requires some changes to the 
computation of the forward and backward probabilities but these are only minor. The details 
are given in cha^tgr 8. 



5. Use the forward a«(?^backward probabilities to compute the probabilities of state occupation 
at each time frame 'pid update the accumulators in the usual way. 

6. Repeat from 2 until all^^^ining utterances have been processed. 

7. Use the accumulators tov^culate new parameter estimates for all of the HMMs. 



These steps can then all be repea^dfis many times as is necessary to achieve the required conver- 
gence. Notice that although the Id^Mion of symbol boundaries in the training data is not required 
(or wanted) for this procedure, the s^^bolic transcription of each training utterance is needed. 

Whereas the extensions needed to 1^ Baum- Welch procedure for training sub-word models are 
relatively minor*', the corresponding extejiBions to the Viterbi algorithm are more substantial. 

In HTK, an alternative formulation o^thg Viterbi algorithm is used called the Token Passing 
Model ' . In brief, the token passing mod^^niakes the concept of a state alignment path explicit. 
Imagine each state j of a HMM at time t a single moveable token which contains, amongst 

other information, the partial log probability-^^ (t) . This token then represents a partial match 
between the observation sequence Oi to and tlri^ model subject to the constraint that the model 
is in state j at time t. The path extension algorrthra represented by the recursion of equation 1.31 
is then replaced by the equivalent token passing a^orjihm which is executed at each time frame t. 
The key steps in this algorithm are as follows \ 

1. Pass a copy of every token in state i to all connec^^^tates j, incrementing the log probability 
of the copy by log[aij\ + log\bj{o{t)\. \ 

2. Examine the tokens in every state and discard all bu^he token with the highest probability. 

In practice, some modifications are needed to deal with th^^on-emitting states but these are 
straightforward if the tokens in entry states are assumed to repi;^s€^t paths extended to time t — 6t 
and tokens in exit states are assumed to represent paths extendedHso time t + St. 

The point of using the Token Passing Model is that it extendg^wery simply to the continuous 
speech case. Suppose that the allowed sequence of HMMs is defin6d>tfy a finite state network. For 
example. Fig. 1.7 shows a simple network in which each word is defin^g^is a sequence of phoneme- 
based HMMs and all of the words are placed in a loop. In this network^-t^ oval boxes denote HMM 
instances and the square boxes denote word-end nodes. This composite MBlswork is essentially just 
a single large HMM and the above Token Passing algorithm applies. Tn^^ly difference now is 
that more information is needed beyond the log probability of the best token-. When the best token 
reaches the end of the speech, the route it took through the network must fee known in order to 
recover the recognised sequence of models. 

® In practice, a good deal of extra work is needed to acliieve efficient operation on large tra^B^e databases. For 
example, the HERest tool includes facilities for pruning on both the forward and backward mr^^s and parallel 
operation on a network of machines. ^ 

See "Token Passing: a Conceptual Model for Connected Speech Recognition Systems" , SJ Young, NH Russell and 
JHS Thornton, CUED Technical Report F _INFENG/TR38, Cambridge University, 1989. Available by anonymous 
ftp from s vr-f tp . eng . cam . ac . uk. 
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^ JCig. 1.7 Recognition Network for 
^^,<<^ontinuously Spoken Word 
\^ Recognition 

The history of a token's route thrmJgh the network may be recorded efficiently as follows. Every 
token carries a pointer called a word efSd\link. When a token is propagated from the exit state of a 
word (indicated by passing through a wDp^r-end node) to the entry state of another, that transition 
represents a potential word boundary. Hmc^ a record called a Word Link Record is generated in 
which is stored the identity of the word {^m which the token has just emerged and the current 
value of the token's link. The token's actual'^Ii^k is then replaced by a pointer to the newly created 
WLR. Fig. 1.8 illustrates this process. 

Once all of the unknown speech has been prae^ssed, the WLRs attached to the link of the best 
matching token (i.e. the token with the highest Ibg Rcobability) can be traced back to give the best 
matching sequence of words. At the same time tae upsitions of the word boundaries can also be 
extracted if required. \ 
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Fig. 1.8 Recording Word Boundary Decisions 
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The token passing algorithm for continuous speech has been described in terms of recording the 
word sequence only. If required, the same principle can be used to record decisions at the model 
and state level. Also, more than just the best token at each word boundary can be saved. This 
gives the potential for generating a lattice of hypotheses rather than just the single best hypothesis. 
Algorithms based on this idea are called lattice N-best. They are suboptimal because the use of a 
single token per state limits the number of different token histories that can be maintained. This 
limitation can be avoided by allowing each model state to hold multiple-tokens and regarding tokens 
as distinct if they come from different preceding words. This gives a class of algorithm called word 
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N-best which has been shown empirically to be comparable in performance to an optimal N-best 
algorithm. 

The above outlines the main idea of Token Passing as it is implemented within HTK. The 
algorithms are embedded in the library modules HNet and HRec and they may be invoked using 
the recogniser tool called HVite. They provide single and multiple-token passing recognition, 
single-best output, lattice output, N-best lists, support for cross-word context-dependency, lattice 
rescoring and forced alignment. 



characteristics of a particular speaker. HTK provides the tools HERest and HVite to perform 
adaptation using a small aniomnt of enrollment or adaptation data. The two tools differ in that 
HERest performs offline supeps&ed adaptation while HVite recognises the adaptation data and 
uses the generated transcriptioits to perform the adaptation. Generally, more robust adaptation is 
performed in a supervised mode provided by HERest, but given an initial well trained model 
set, HVite can still achieve noti(3€^le improvements in performance. Full details of adaptation 
and how it is used in HTK can be K»fihd in Chapter 9. 




recognition techniques described previously can produce high perfor- 
ithese systems can be improved upon by customising the HMMs to the 
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Chapter 2 

An Ove^iew of the HTK Toolkit 




The basic principles of HMM-based recognition jJrere outlined in the previous chapter and a 
number of the key HTK tools have already been mentioned. This chapter describes the software 
architecture of a HTK tool. It then gives a brief ou^ii^e of all the HTK tools and the way that 
they are used together to construct and test HMM-bas^recognisers. For the benefit of existing 
HTK users, the major changes in recent versions of HTK g^listed. The following chapter will then 
illustrate the use of the HTK toolkit by working through a practical example of building a simple 
continuous speech recognition system. 



2.1 HTK Software Architecture 



o 



Much of the functionality of HTK is built into the library modui^^ These modules ensure that 
every tool interfaces to the outside world in exactly the same way. '^jfliey also provide a central 
resource of commonly used functions. Fig. 2.1 illustrates the softwar^'i^^jucture of a typical HTK 
tool and shows its input/output interfaces. 

User input / output and interaction with the operating system is controftBajfey the library module 
HShell and all memory management is controlled by HMem. Math supporfis provided by HMath 
and the signal processing operations needed for speech analysis are in HSiGP.€5ach of the file types 
required by HTK has a dedicated interface module. HLabel provides the inte^f^e for label files, 
HLM for language model files, HNet for networks and lattices, HDiCT for dicti^Sfiries, HVQ for 
VQ codebooks and HModel for HMM definitions. 
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Fig7/2.1 Software Architecture 
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All speech input and output at the -vw^rt^form level is via HWave and at the parameterised level 
via HParm. As well as providing a conl^isteat interface, HWave and HLabel support multiple 
file formats allowing data to be importecKirQm other systems. Direct audio input is supported 
by HAuDiO and simple interactive graphics ^provided by HGraf. HUtil provides a number of 
utility routines for manipulating HMMs while^J^TRAiN and HFB contain support for the various 
HTK training tools. HAdapt provides suppor^r^r the various HTK adaptation tools. Finally, 
HRec contains the main recognition processing funjstions. 

As noted in the next section, fine control over the behaviour of these library modules is provided 
by setting configuration variables. Detailed descrip'tions of the functions provided by the library 
modules are given in the second part of this book to^ the relevant configuration variables are 
described as they arise. For reference purposes, a comp](^ list is given in chapter 18. 

CO 

2.2 Generic Properties of a HTK T©^ 

HTK tools are designed to run with a traditional command-hns^tyle interface. Each tool has a 
number of required arguments plus optional arguments. The lat\ep>are always prefixed by a minus 
sign. As an example, the following command would invoke the mythical HTK tool called HFoo 

HFoo -T 1 -f 34.3 -a -s myfile filel file2 

This tool has two main arguments called filel and f ile2 plus four iS^^nal arguments. Options 
are always introduced by a single letter option name followed where approj^n^te by the option value. 
The option value is always separated from the option name by a space. TbtJ^, the value of the -f 
option is a real number, the value of the -T option is an integer number ahd the value of the -s 
option is a string. The -a option has no following value and it is used as a siAiple flag to enable or 
disable some feature of the tool. Options whose names are a capital letter hav^Jke same meaning 
across all tools. For example, the -T option is always used to control the trace @tput of a HTK 
tool. vO, 

In addition to command line arguments, the operation of a tool can be controlled*Dy parameters 
stored in a configuration file. For example, if the command 



HFoo -C config -f 34.3 -a -s myfile filel file2 

is executed, the tool HFoo will load the parameters stored in the configuration file config during 
its initialisation procedures. Multiple configuration files can be specified by repeating the -C option, 
e.g. 



HFoo -C configl -C coiifig2 -f 34.3 -a -s myfile filel file2 
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Configuration parameters can sometimes be used as an alternative to using command line argu- 
ments. For example, trace options can always be set within a configuration file. However, the main 
use of configuration files is to control the detailed behaviour of the library modules on which all 
HTK tools depend. 

Although this style of command-line working may seem old-fashioned when compared to modern 
graphical user interfaces, it has many advantages. In particular, it makes it simple to write shell 
scripts to control HTK tool execution. This is vital for performing large-scale system building 
and experimentation. Furthermore, defining all operations using text-based commands allows the 
details of system construction or experimental procedure to be recorded and documented. 

Finally, note that ^^^mmary of the command line and options for any HTK tool can be obtained 
simply by executing thgdool with no arguments. 

2.3 The Toolkit 

The HTK tools are best intn^ui^ed by going through the processing steps involved in building a 
sub-word based continuous spe*?n recogniser. As shown in Fig. 2.2, there are 4 main phases: data 
preparation, training, testing an^lt^imlysis. 

2.3.1 Data Preparation Ti^ls 

In order to build a set of HMMs, a se^^^peech data files and their associated transcriptions are 
required. Very often speech data will be'*^tained from database archives, typically on CD-ROMs. 
Before it can be used in training, it mustN^ converted into the appropriate parametric form and 
any associated transcriptions must be conv^ftvd to have the correct format and use the required 
phone or word labels. If the speech needs to'^e>j:ecorded, then the tool HSLab can be used both 
to record the speech and to manually annotatent^ith any required transcriptions. 

Although all HTK tools can parameterise wawforms on-the-fly, in practice it is usually better to 
parameterise the data just once. The tool HCOP^' used for this. As the name suggests, HCopy 
is used to copy one or more source files to an outpu^^le. Normally, HCopy copies the whole file, 
but a variety of mechanisms are provided for extractktk segments of files and concatenating files. 
By setting the appropriate configuration variables, alMjjfeut files can be converted to parametric 
form as they are read-in. Thus, simply copying each nle^in this manner performs the required 
encoding. The tool HLiST can be used to check the contentyof any speech file and since it can also 
convert input on-the-fly, it can be used to check the result* of .any conversions before processing 
large quantities of data. Transcriptions will also need prepapfTO. Typically the labels used in the 
original source transcriptions will not be exactly as required, for ej^ample, because of differences in 
the phone sets used. Also, HMM training might require the laoelsvto be context-dependent. The 
tool HLEd is a script-driven label editor which is designed to msise the required transformations 
to label flies. HLEd can also output flies to a single Master Lq^^File MLF which is usually 
more convenient for subsequent processing. Finally on data preparati^^ HLStats can gather and 
display statistics on label flies and where required, HQuANT can be i^p^d^to build a VQ codebook 
in preparation for building discrete probability HMM system. x->. 



2.3.2 Training Tools 



The second step of system building is to deflne the topology required for each^SMM by writing a 
prototype deflnition. HTK allows HMMs to be built with any desired topology^J^^MM deflnitions 
can be stored externally as simple text flies and hence it is possible to edit them wim^Jiy convenient 
text editor. Alternatively, the standard HTK distribution includes a number of "^jKimple HMM 
prototypes and a script to generate the most common topologies automatically. With the exception 
of the transition probabilities, all of the HMM parameters given in the prototype deflnition are 
ignored. The purpose of the prototype deflnition is only to specify the overall characteristics and 
topology of the HMM. The actual parameters will be computed later by the training tools. Sensible 
values for the transition probabilities must be given but the training process is very insensitive 
to these. An acceptable and simple strategy for choosing these probabilities is to make all of the 
transitions out of any state equally likely. 
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Fig. 2.2 "^yK Processing Stages 



The actual training process takes place in sfeaes and it is illustrated in more detail in Fig. 2.3. 
Firstly, an initial set of models must be createdC If there is some speech data available for which 
the location of the sub-word (i.e. phone) boundafs^ have been marked, then this can be used as 
bootstrap data. In this case, the tools HInit and Hl^^T provide isolated word style training using 
the fully labelled bootstrap data. Each of the required HMMs is generated individually. HInit 
reads in all of the bootstrap training data and cuts ouf^M^i the examples of the required phone. It 
then iteratively computes an initial set of parameter vakies using a segmental k-means procedure. 
On the first cycle, the training data is uniformly segment^^ each model state is matched with the 
corresponding data segments and then means and variance are estimated. If mixture Gaussian 
models are being trained, then a modified form of k-mean»^ustering is used. On the second 
and successive cycles, the uniform segmentation is replaced oy^iterbi alignment. The initial 
parameter values computed by HInit are then further re-estinmted by HRest. Again, the fully 
labelled bootstrap data is used but this time the segmental k-meaii procedure is replaced by the 
Baum- Welch re-estimation procedure described in the previous ch^^ter. When no bootstrap data 
is available, a so-called flat start can be used. In this case all of ther )^one models are initialised 
to be identical and have state means and variances equal to the globq 
The tool HCompV can be used for this. 




leech mean and variance. 
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Once an initial set of models has been created, the tool HERest is used to perform embedded 
training using the entire training set. HERest performs a stng^ Baum- Welch re-estimation of the 
whole set of HMM phone models simultaneously. For each tsaAiing utterance, the corresponding 
phone models are concatenated and then the forward-backward ^:^rithm is used to accumulate the 
statistics of state occupation, means, variances, etc., for each H]\^^ in the sequence. When all of 
the training data has been processed, the accumulated statistics arerssed to compute re-estimates 
of the HMM parameters. HERest is the core HTK training toolS-Tt is designed to process large 
databases, it has facilities for pruning to reduce computation and it cs*ibe run in parallel across a 
network of machines. ''^kJ 

The philosophy of system construction in HTK is that HMMs should(S^ refined incrementally. 
Thus, a typical progression is to start with a simple set of single Gaussi^Kcontext-independent 
phone models and then iteratively refine them by expanding them to include context-dependency 
and use multiple mixture component Gaussian distributions. The tool HHEo'is.^ HMM definition 
editor which will clone models into context-dependent sets, apply a variety of pacameter tyings and 
increment the number of mixture components in specified distributions. The u^Xl process is to 
modify a set of HMMs in stages using HHEd and then re-estimate the parameter^'^)the modified 
set using HERest after each stage. To improve performance for specific speakers the xools HERest 
and HVite can be used to adapt HMMs to better model the characteristics of particular speakers 
using a small amount of training or adaptation data. The end result of which is a speaker adapted 
system. 

The single biggest problem in building context-dependent HMM systems is always data insuffi- 
ciency. The more complex the model set, the more data is needed to make robust estimates of its 
parameters, and since data is usually limited, a balance must be struck between complexity and 
the available data. For continuous density systems, this balance is achieved by tying parameters 
together as mentioned above. Parameter tying allows data to be pooled so that the shared param- 
eters can be robustly estimated. In addition to continuous density systems, HTK also supports 
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fully tied mixture systems and discrete probability systems. In these cases, the data insufhciency 
problem is usually addressed by smoothing the distributions and the tool HSmooth is used for 
this. 



2.3.3 Recognition Tools 

HTK provides a recognition tool called HVite that allows recognition using language models and 
lattices. HLRecsore is a tool that allows lattices generated using HVite (or HDecode) to be 
manipulated for example to apply a more complex language model. An additional recogniser is 
also available as an asaension to HTK HDecode. Note: HDecode is distributed under a more 



restrictive licence agre^^ent. 

A 

HVite ^ 

• 

HTK provides a recognition torsi called HVite which uses the token passing algorithm described in 
the previous chapter to perfoim«3^iterbi-based speech recognition. HVite takes as input a network 
describing the allowable word s^ij^nces, a dictionary defining how each word is pronounced and a 
set of HMMs. It operates by con^'tjng the word network to a phone network and then attaching 
the appropriate HMM definition t^^ach phone instance. Recognition can then be performed on 
either a list of stored speech files or (^direct audio input. As noted at the end of the last chapter, 
HVite can support cross-word triphoaes and it can run with multiple tokens to generate lattices 
containing multiple hypotheses. It can/^o be configured to rescore lattices and perform forced 
alignments. 

The word networks needed to drive HVtte are usually either simple word loops in which any 
word can follow any other word or they are di^jyted graphs representing a finite-state task grammar. 
In the former case, bigram probabilities are^KSrmally attached to the word transitions. Word 
networks are stored using the HTK standard laxtj^ format. This is a text-based format and hence 
word networks can be created directly using a text-ediJ;or. However, this is rather tedious and hence 
HTK provides two tools to assist in creating word^etworks. Firstly, HBuild allows sub-networks 
to be created and used within higher level networks, rfftnce, although the same low level notation is 
used, much duplication is avoided. Also, HBuild can bf^sed to generate word loops and it can also 
read in a backed-off bigram language model and modnv^^e word loop transitions to incorporate 
the bigram probabilities. Note that the label statistics toblJJLSTATS mentioned earlier can be used 
to generate a backed-off bigram language model. 

As an alternative to specifying a word network directly^ a higher level grammar notation can 
be used. This notation is based on the Extended Backus I'^m- Form (EBNF) used in compiler 
specification and it is compatible with the grammar specification-^nguage used in earlier versions 
of HTK. The tool HParse is supplied to convert this notation intovthe equivalent word network. 

Whichever method is chosen to generate a word network, it is Tr^mil to be able to see examples 
of the language that it defines. The tool HSGen is provided t4>a^) this. It takes as input a 
network and then randomly traverses the network outputting word stt;;jigs. These strings can then 
be inspected to ensure that they correspond to what is required. H^'GKn can also compute the 
empirical perplexity of the task. 

Finally, the construction of large dictionaries can involve merging severai^Mirces and performing 
a variety of transformations on each sources. The dictionary management tool HDMan is supplied 
to assist with this process. • 

o 

HLRescore 

HLRescore is a tools for manipulating lattices. It reads lattices in standard latt^ format (for 
example produced by HVite) and applies one of the following operations on them: 

• finding 1-best path through lattice: this allows language model scale factors and insertion 
penalties to be optimised rapidly; 

• expanding lattices with new language model: allows the application of more complex language, 
e,g, 4-grams, than can be efficiently used on the decoder. 



• converting lattices to equivalent word networks: this is necessary prior to using lattices gen- 
erated with HVite (or HDecode) to merge duphcate paths. 
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calculating various lattice statistics 

pruning lattice using forward-backward scores: efficient pruning of lattices for 



• converting word MLF files to lattices with a language model: this is necessary for generating 
numerator lattices for discriminative training. 

HLRescore expects lattices which are directed acyclic graphs (DAGs). If cycles occur in the 
lattices then HLRescore will throw an error. These cycles may occur after the merging operation 
(-m option) with HLRescore. 

HDecode ^ 

HDecode is a decoder s^bed for large vocabulary speech recognition and lattice generation, that 
is available as an extensior^to HTK, distributed under a slightly more restrictive licence. Similar 
to HViTE, HDecode transcarrbes speech files using a HMM model set and a dictionary (vocabu- 
lary). The best transcription ir^gbthesis will be generated in the Master Label File (MLF) format. 
Optionally, multiple hypotheses^an also be generated as a word lattice in the form of the HTK 
Standard Lattice Format (SLF).^ . 

The search space of the recognfCiph process is defined by a model based network, produced from 
expanding a supplied language modeij^ a word level lattice using the dictionary. In the absence of 
a word lattice, a language model must/be supplied to perform a full decoding. The current version 
of HDecode only supports trigram ana^igram full decoding. When a word lattice is supplied, the 
use of a language model is optional. This^mode of operation is known as lattice rescoring. 

HDecode expects lattices where ther^^re no duplicates of word paths. However by default 
lattices that are generated by HDecode cs^ains duplicates due to multiple pronunciations and 
optional inter-word silence. To modify the lat^j^s to be suitable for lattic rescoring HLRescore 
should be used to merge (using the -m option)rmd*iple paths. Note as a side-effect of this merged 
lattices may not be DAGs (cycles may exist), thus rriejged lattices may not be suitable for applying 
more complex LMs (using for example HLRescoHe)^ 

The current implementation of HDecode has (a\iumber of limitations for use as a general 
decoder for HMMs. It has primarily been developed f^^peech recognition. Limitations include: 

• only works for cross-word triphones; 

• sil and sp models are reserved as silence models an^ire, by default, automatically added to 
the end of all "words" in the pronunciation dictionary* 

• lattices generated with HDecode must be merged to remojie duplicate word paths prior to 
being used for lattice rescoring with HDecode and HViTE^^ 

2.3.4 Analysis Tool 

Once the HMM-based recogniser has been built, it is necessary to evaljAte its performance. This 
is usually done by using it to transcribe some pre-recorded test sentenceVund match the recogniser 
output with the correct reference transcriptions. This comparison is perforaied by a tool called 
HResults which uses dynamic programming to align the two transcriptio^^and then count sub- 
stitution, deletion and insertion errors. Options are provided to ensure tha^ the algorithms and 
output formats used by HResults are compatible with those used by the US^ational Institute 
of Standards and Technology (NIST). As well as global performance measure&j-JIRESULTS can 
also provide speaker-by-speaker breakdowns, confusion matrices and time-aligne^transcriptions. 
For word spotting applications, it can also compute Figure of Merit (FOM) score^^fend Receiver 
Operating Curve (ROC) information. 



2.4 What's New In Version 3.4 

This section lists the new features in HTK Version 3.4 compared to the preceding Version 3.3. 

1. HMMIRest has now been added as a tool for performing discriminative training. This sup- 
ports both Minimum Phone Error (MPE) and Maximum Mutual Information (MMI) training. 
To support this additional library modules for performing the forward-backward algorithm 
on lattices, and the ability to mark phone boundary times in lattices have been added. 
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2. HDecode has now been added as a tool for performing large vocabulary decoding. See 
section 17.6 for further details of limitations associated with this tool. 

3. HERest has been extended to support estimating semitied and HLDA transformations. 

4. Compilation issues have now been dealt with. 

5. Many other smaller changes and bug fixes have been integrated. 

2.4.1 New In ^rsion 3.3 

This section lists the n^J^eatures in HTK Version 3.3 compared to the preceding Version 3.2. 



1. HERest now inco^isrates the adaptation transform generation that was previously per- 
formed in HEAdapt^ The range of linear transformations and the ability to combine trans- 
forms hierarchically has mow been included. The system also now supports adaptive training 
with constrained MLLlVtraiisforms. 

2.4.2 New In Version 3.2'^^ 

This section lists the new features in Version 3.2 compared to the preceding Version 3.1. 



1. The HLM toolkit has been incorpOTa^sd into HTK. It supports the training and testing of 
word or class-based n-gram language^^dels. 

2. HPARM supports global feature space ^r^sforms. 

3. HPARM now supports third differentials parameters) . 

4. A new tool named HLRescore offers suppoi^^for a number of lattice post-processing opera- 
tions such as lattice pruning, finding the 1-best patji in a lattice and language model expansion 
of lattices. 

5. HERest supports 2-model re-estimation which allbm^the use of a separate alignment model 
set in the Baum- Welch re-estimation. 

6. The initialisation of the decision-tree state clustering in^^3ED has been improved. 

7. HHEd supports a number of new commands related to Vi^l^nce fiooring and decreasing the 
number of mixtures. 

8. A major bug in the estimation of block-diagonal MLLR tran^^teis has been fixed. 

9. Many other smaller changes and bug fixes have been integrated.^^,^^ 

2.4.3 New In Version 3.1 

This section lists the new features in HTK Version 3.1 compared to the preceding Version 3.0 which 
was functionally equivalent to Version 2.2. 

1. HPARM supports Perceptual Linear Prediction (PLP) feature extraction. C3 

2. HPARM supports Vocal Tract Length Normalisation (VTLN) by warping th^^equency axis 
in the filterbank analysis. 

3. HPARM supports variance scaling. 

4. HPARM supports cluster-based cepstral mean and variance normalisation. 

5. All tools support an extended filename syntax that can be used to deal with unsegmented 
data more easily. 
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2.4.4 New In Version 2.2 

This section hsts the new features and refinements in HTK Version 2.2 compared to the preceding 
Version 2.1. 

1. Speaker adaptation is now supported via the HEAdapt and HVite tools, which adapt a 
current set of models to a new speaker and/or environment. 

• HEAdapt performs offline supervised adaptation using maximum likelihood linear re- 
gression (MLLR) and/or maximum a-posteriori (MAP) adaptation. 

• HViTE pe^^^s unsupervised adaptation using just MLLR. 

Both tools can be u§^ in a static mode, where all the data is presented prior to any adaptation, 
or in an incrementarfashion. 

2. Improved support for BffJ^WAV files 

In addition to 16-bit PCA^near, HTK can now read 

• 8-bit CCITT mu-law 

• 8-bit CCITT a-law ^ 

• 8-bit PCM Unear 

2.4.5 Features Added To Vei^pon 2.1 

For the benefit of users of earlier versionS''^ HTK this section lists the main changes in HTK 
Version 2.1 compared to the preceding Versiqj;p2.0. 

1. The speech input handling has been partiaS»^e-designed and a new energy-based speech/silence 
detector has been incorporated into HParM. vThe detector is robust yet flexible and can be 
configured through a number of configuration variables. Speech/silence detection can now be 
performed on waveform files. The calibrationC.oi speech/silence detector parameters is now 
accomplished by asking the user to speak an arl^^jary sentence. 



2. HParm now allows random noise signal to be added to waveform data via the configuration 
parameter ADDDITHER. This prevents numerical ov^^ows which can occur with artificially 
created waveform data under some coding schemes. » 

3. HNet has been optimised for more efflcient operation wfeen performing forced alignments of 
utterances using HVite. Further network optimisations ^^ored to biphone/triphone-based 
phone recognition have also been incorporated. 

4. HViTE can now produce partial recognition hypothesis even ^i^£hno tokens survive to the end 
of the network. This is accomplished by setting the HRec confi|j&ation parameter FORCEOUT 
to true. ^("v) 

5. Dictionary support has been extended to allow pronunciation proli?^^ies to be associated 
with different pronunciations of the same word. At the same time, HA\jte now allows the use 
of a pronunciation scale factor during recognition. • 

6. HTK now provides consistent support for reading and wr iting of HTK binaOf iles (waveforms, 
binary MMFs, binary SLFs, HERest accumulators) across different machia^ architectures 
incorporating automatic byte swapping. By default, all binary data files hancljep by the tools 
are now written/read in big-endian (NONVAX) byte order. The default behavior can be changed 
via the configuration parameters NATURALREADDRDER and NATURALWRITEORDER. 

7. HWave supports the reading of waveforms in Microsoft WAVE file format. 



8. 



HAuDiO allows key-press control of live audio input. 



Chapter 3 

A Tutoml Example of Using HTK 




call Julian " 



dial 332654 " 



This final chapter of the tutorial part of the book^ill describe the construction of a recogniser 
for simple voice dialling applications. This recognisfei>>will be designed to recognise continuously 
spoken digit strings and a limited set of names. It is suJi^word based so that adding a new name to 
the vocabulary involves only modification to the pronokncing dictionary and task grammar. The 
HMMs will be continuous density mixture Gaussian tied-^^)te triphones with clustering performed 
using phonetic decision trees. Although the voice dialling J;ask itself is quite simple, the system 
design is general-purpose and would be useful for a range of s^slications. 

The system will be built from scratch even to the extent t5i recording training and test data 
using the HTK tool HSLab. To make this tractable, the system will be speaker dependent^, but 
the same design would be followed to build a speaker independent Ij^em. The only difference being 
that data would be required from a large number of speakers and^^^^e would be a consequential 
increase in model complexity. 

Building a speech recogniser from scratch involves a number of intet-telated subtasks and ped- 
agogically it is not obvious what the best order is to present them. fnTrie presentation here, the 
ordering is chronological so that in effect the text provides a recipe that WuW be followed to con- 
struct a similar system. The entire process is described in considerable det^Pin order give a clear 
view of the range of functions that HTK addresses and thereby to motivate tj^e rest of the book. 

The HTK software distribution also contains an example of constructing a/recognition system 
for the 1000 word ARPA Naval Resource Management Task. This is containai4n the directory 
RMHTK of the HTK distribution. Further demonstration of HTK's capabilities can-p£ found in the 
directory HTKDemo. Some example scripts that may be of assistance during the tutorfS^are available 
in the HTKTutorial directory. 

At each step of the tutorial presented in this chapter, the user is advised to thoroughly read 
the entire section before executing the commands, and also to consult the reference section for 
each HTK tool being introduced (chapter 17), so that all command line options and arguments are 
clearly understood. 



^ The final stage of the tutorial deals with adapting the speaker dependent models for new speakers 
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3.1 Data Preparation 

The first stage of any recogniser development project is data preparation. Speech data is needed 
both for training and for testing. In the system to be built here, all of this speech will be recorded 
from scratch and to do this scripts are needed to prompt for each sentence. In the case of the 
test data, these prompt scripts will also provide the reference transcriptions against which the 
recogniser 's performance can be measured and a convenient way to create them is to use the task 
grammar as a random generator. In the case of the training data, the prompt scripts will be used in 
conjunction with a pronunciation dictionary to provide the initial phone level transcriptions needed 
to start the HMM turning process. Since the application requires that arbitrary names can be 
added to the recogniser^ training data with good phonetic balance and coverage is needed. Here 
for convenience the promm scripts needed for training are taken from the TIMIT acoustic-phonetic 
database. 

It follows from the abo¥e that before the data can be recorded, a phone set must be defined, 
a dictionary must be constrii^^^d to cover both training and testing and a task grammar must be 
defined. 

3.1.1 Step 1 - the Taslc^ammar 

The goal of the system to be built h^^ is to provide a voice-operated interface for phone dialling. 
Thus, the recogniser must handle digifc&trings and also personal name lists. Examples of typical 
inputs might be 

Dial three three two six five four 
Dial nine zero four one oh nine 
Phone Woodland 
Call Steve Young 

V^' . . . 

HTK provides a grammar definition language for^5ecifying simple task grammars such as this. 
It consists of a set of variable definitions followed by a>regular expression describing the words to 
recognise. For the voice dialling application, a suitabley^ammar might be 

$digit = ONE I TWO I THREE I FOUR I FIVE I ^ 

SIX I SEVEN I EIGHT I NINE I OH I 
$name = [ JODP ] JANSEN I 
[ JULIAN ] ODELL I 



[ DAVE ] DLLASDN I 



o 



[ PHIL ] WOODLAND I 
[ STEVE ] YOUNG; 
( SENT-START ( DIAL <$digit> I (PHONE I CALL) $name) Sfewt-END 

where the vertical bars denote alternatives, the square brackets denotejjpkonal items and the angle 
braces denote one or more repetitions. The complete grammar can be-pepicted as a network as 
shown in Fig. 3.1. . 



O 
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Fig. 3.!^^jrammar for Voice Dialling 




Word Net 
(wdnet ) 



Fig. 3.2 
Step 1 




o 



The above high level representation of a task grammar is pro*j^^ for user convenience. The 
HTK recogniser actually requires a word network to be defined usii^ a low level notation called 
HTK Standard Lattice Format (SLF) in which each word instance and'0(c^ word-to- word transition 
is listed explicitly. This word network can be created automatically from grammar above using 
the HParse tool, thus assuming that the file grEun contains the above grar^ri^ar, executing 

HParse grsun wdnet 



will create an equivalent word network in the file wdnet (see Fig 3.2). 
3.1.2 Step 2 - the Dictionary 



o 

% 



The first step in building a dictionary is to create a sorted list of the required words. In the telephone 
dialling task pursued here, it is quite easy to create a list of required words by hand. However, if 
the task were more complex, it would be necessary to build a word list from the sample sentences 
present in the training data. Furthermore, to build robust acoustic models, it is necessary to train 
them on a large set of sentences containing many words and preferably phonetically balanced. For 
these reasons, the training data will consist of English sentences unrelated to the phone recognition 
task. Below, a short example of creating a word list from sentence prompts will be given. As noted 
above the training sentences given here are extracted from some prompts used with the TIMIT 
database and for convenience reasons they have been renumbered. For example, the first few items 
might be as follows 
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50001 ONE VALIDATED ACTS OF SCHOOL DISTRICTS 

50002 TWO OTHER CASES ALSO WERE UNDER ADVISEMENT 

50003 BOTH FIGURES WOULD GO HIGHER IN LATER YEARS 

50004 THIS IS NOT A PROGRAM OF SOCIALIZED MEDICINE 
etc 

The desired training word list (wlist) could then be extracted automatically from these. Before 
using HTK, one would need to edit the text into a suitable format. For example, it would be 
necessary to change all white space to newlines and then to use the UNIX utilities sort and uniq 
to sort the words int^ a unique alphabetically ordered set, with one word per line. The script 
prompts2wlist from HTKTutorial directory can be used for this purpose. 

The dictionary itseS2e^ be built from a standard source using HDMan. For this example, the 
British English BEEP prqjlisuncing dictionary will be used'^ . Its phone set will be adopted without 
modification except that tl^ stress marks will be removed and a short-pause (sp) will be added to 
the end of every pronunciaticmf\If the dictionary contains any silence markers then the MP command 
will merge the sil and sp pnTme^ into a single sil. These changes can be applied using HDMan 
and an edit script (stored in gl'stal . ded) containing the three commands 



AS sp 

RS emu 

MP sil sil 



sp 



where emu refers to a style of stress mal; 
digit appended to the phone name (e.g. 



ag in which the lexical stress level is marked by a single 
,j.i2.ijieans the phone eh with level 2 stress). 



TIMIT 

Prompts 





r 


sort 1 uniq 
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(Dictionary^ 
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Fig. 3.3 Step 2 ^ 
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The command 

HDMan -m -w wlist -n monophonesl -1 dlog diet beep names 



will create a new dictionary called diet by searching the source dictionaries beep^pl names to find 
pronunciations for each word in wlist (see Fig 3.3). Here, the wlist in question j^^Ws only to be 
a sorted list of the words appearing in the task grammar given above. < 

Note that names is a manually constructed file containing pronunciations for the proper names 
used in the task grammar. The option -1 instructs HDMan to output a log file dlog which 
contains various statistics about the constructed dictionary. In particular, it indicates if there are 
words missing. HDMan can also output a list of the phones used, here called monophonesl. Once 
training and test data has been recorded, an HMM will be estimated for each of these phones. 

The general format of each dictionary entry is 

WORD [outsym] pi p2 p3 .... 



^Available by anonymous ftp from svr-ftp . eng. cerni. ac .uk/pub/comp . speech/dictionaries/beep . tar .gz. Note 
that items beginning with unmatched quotes, found at the start of the dictionary, should be removed. 
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which means that the word WORD is pronounced as the sequence of phones pi p2 p3 .... The 
string in square brackets specifies the string to output when that word is recognised. If it is omitted 
then the word itself is output. If it is included but empty, then nothing is output. 
To see what the dictionary is like, here are a few entries. 




Notice that function words such as A'*a«d TO have multiple pronunciations. The entries for SENT-START 
and SENT-END have a silence model s'^i^s their pronunciations and null output symbols. 




3.1.3 Step 3 - Recording the 

The training and test data will be recordesi<using the HTK tool HSLab. This is a combined 
waveform recording and labelling tool. In this example HSLab will be used just for recording, as 
labels already exist. However, if you do not ha-w pre-existing training sentences (such as those from 
the TIMIT database) you can create them eithe^^om pre-existing text (as described above) or by 
labelling your training utterances using HSLab. ^^Lab is invoked by typing 

HSLab noneune C^. 

This will cause a window to appear with a waveforrnd^^play area in the upper half and a row 
of buttons, including a record button in the lower half. ^ When the name of a normal file is given 
as argument, HSLab displays its contents. Here, the spekial file name nonsmie indicates that new 
data is to be recorded. HSLab makes no special provision for nfompting the user. However, each 
time the record button is pressed, it writes the subsequent ^fMording alternately to a file called 
noname_0 . and to a file called noname_l . . Thus, it is simple tow^te a shell script which for each 
successive line of a prompt file, outputs the prompt, waits for eitber noname_0. or noncmie_l. to 
appear, and then renames the file to the name prepending the prOTUjt (see Fig. 3.4). 

While the prompts for training sentences already were providedj^er above, the prompts for test 
sentences need to be generated before recor ding them. The tool HSC^ can be used to do this by 
randomly traversing a word network and outputting each word encou(j*^^d. For example, typing 

HSGen -1 -n 200 wdnet diet > testprompts 

would generate 200 numbered test utterances, the first few of which would rook something like: 
1. 



PHONE YOUNG Q). 

2. DIAL OH SIX SEVEN SEVEN OH ZERO Q 

3. DIAL SEVEN NINE OH OH EIGHT SEVEN NINE NINE 

4. DIAL SIX NINE SIX TWO NINE FOUR ZERO NINE EIGHT 

5. CALL JULIAN ODELL 
. . . etc 



These can be piped to construct the prompt file testprompts for the required test data. 
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3.1.4 Step 4 - Creating the Transcription Files 



Word Net 

dnet) 







HSGen 




Test Files 
T0001.wav 
T0002.wav 
... etc 



step 3 



To train a set of HMMs, every file of K^aining data must have an associated phone level tran- 
scription. Since there is no hand labelled da^^o bootstrap a set of models, a flat-start scheme will 
be used instead. To do this, two sets of phone^^^nscriptions will be needed. The set used initially 
will have no short-pause (sp) models between ^^5fds. Then once reasonable phone models have 
been generated, an sp model will be inserted betwe^ words to take care of any pauses introduced 
by the speaker. ^ 

The starting point for both sets of phone transcr^tipn is an orthographic transcription in HTK 
label format. This can be created fairly easily usin^Ts^ text editor or a scripting language. An 
example of this is found in the RM Demo at point 0.4. (Alternatively, the script prompts2mlf has 
been provided in the HTKTutorial directory. The effect sn/5IiJd be to convert the prompt utterances 
exampled above into the following form: 



# ! MLF ! # 

"*/S0001.1ab" 

ONE 

VALIDATED 

ACTS 

OF 

SCHOOL 
DISTRICTS 

"*/S0002.Iab" 

TWO 

OTHER 

CASES 

ALSO 

WERE 

UNDER 

ADVISEMENT 



o 



o 



o 

% 



"*/S0003.Iab" 
BOTH 
FIGURES 
(etc. ) 



As can be seen, the prompt labels need to be converted into path names, each word should be 
written on a single line and each utterance should be terminated by a single period on its own. 
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The first line of the file just identifies the file as a Master Label File (MLF). This is a single file 
containing a complete set of transcriptions. HTK allows each individual transcription to be stored 
in its own file but it is more efficient to use an MLF. 

The form of the path name used in the MLF deserves some explanation since it is really a pattern 
and not a name. When HTK processes speech files, it expects to find a transcription (or label file) 
with the same name but a different extension. Thus, if the file /root/sjy/data/SOOOl.wav was 
being processed, HTK would look for a label file called /root/sjy/data/SOOOl . lab. When MLF 
files are used, HTK scans the file for a pattern which matches the required label file name. However, 
an asterix will match any character string and hence the pattern used in the example is in effect 
path independent. It'^jprefore allows the same transcriptions to be used with different versions of 
the speech data to be ^red in different locations. 

Once the word level ^W;F has been created, phone level MLFs can be generated using the label 
editor HLEd. For exampl^assuming that the above word level MLF is stored in the file words .mlf, 
the command • 

HLEd -1 -d diet ^^^^^honesO .mlf mkphonesO . led words. mlf 

will generate a phone level tran^^ption of the following form where the -1 option is needed to 
generate the path '*' in the outpu^^^tterns. 

# ! MLF ! # 

"*/S0001.1ab" (S) 

sil 

w 

ae ^ 

ih 

This process is illustrated in Fig. 3.5. ^ 

The HLEd edit script mkphonesO . led contains the foBowing commands 



d 

. . etc 



EX 

IS sil sil ^ > 

DE sp 

The expand EX command replaces each word in words . mlf by the ^Byresponding pronunciation in 
the dictionary file diet. The IS command inserts a silence model s^MTat the start and end of every 
utterance. Finally, the delete DE command deletes all short-pause spMbels, which are not wanted 
in the transcription labels at this point. 




o 

% 
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TIMIT 

Prompts 



Edit Script 

(mkphonesO . ledl 



Word Level 
Transcription 
(wor ds . ml f ) 




Phone Level 
Transcription 
(phones 0. irl f ) 



3.1.5 Step 5 - Coding the Data 

The final stage of data preparation is to parar»^erise the raw speech waveforms into sequences 
of feature vectors. HTK support both FFT-based aMd LPC-based analysis. Here Mel Frequency 
Cepstral Coefficients (MFCCs), which are derived Yroci FFT-based log spectra, will be used. 

Coding can be performed using the tool HCoPA^configured to automatically convert its input 
into MFCC vectors. To do this, a configuration file ^£^»if ig) is needed which specifies all of the 
conversion parameters. Reasonable settings for these ar^^ follows 

CO 



# Coding parameters 
TARGETKIND = MFCC_0 
TARGETRATE = 100000.0 
SAVECOMPRESSED = T 
SAVEWITHCRC = T 
WINDOWSIZE = 250000.0 
USEHAMMING = T 
PREEMCOEF =0.97 
NUMCHANS = 26 
CEPLIFTER = 22 
NUMCEPS =12 
ENORMALISE = F 



o 

o 



Some of these settings are in fact the default setting, but they are given explkstly here for com- 
pleteness. In brief, they specify that the target parameters are to be MFCC usineXZo as the energy 
component, the frame period is 10msec (HTK uses units of 100ns), the output skeujd be saved in 
compressed format, and a crc checksum should be added. The FFT should use a Hamining window 
and the signal should have first order preemphasis applied using a coefficient of 0.97. The filterbank 
should have 26 channels and 12 MFCC coefficients should be output. The variable ENORMALISE is 
by default true and performs energy normalisation on recorded audio files. It cannot be used with 
live audio and since the target system is for live audio, this variable should be set to false. 

Note that explicitly creating coded data files is not necessary, as coding can be done " on-the-fly" 
from the original waveform files by specifying the appropriate configuration file (as above) with the 
relevant HTK tools. However, creating these files reduces the amount of preprocessing required 
during training, which itself can be a time-consuming process. 

To run HCOPY, a list of each source file and its corresponding output file is needed. For example, 
the first few lines might look like 
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/root/sjy/waves/SOOOl .wav /root/sjy/train/SOOOl .mf c 
/root/ sjy/waves/S0002 .wav /root/ sjy/traiii/S0002 .mf c 
/root/ sjy/waves/S0003 . wav /root/ sjy/train/S0003.mf c 
/root/ sjy/waves/S0004 . wav /root/ sjy/train/S0004 .mf c 
(etc. ) 

Files containing lists of files are referred to as script files'^ and by convention are given the extension 
scp (although HTK does not demand this). Script files are specified using the standard -S option 
and their contents are read simply as extensions to the command line. Thus, they avoid the need 
for command lines w^i^several thousand arguments"'. 



3i 



Configuration 
File 

(conf ig) 



Waveform Fit^ 



50001. wav 

50002. wav 

50003. wav 
etc 




HCoPY 



Script File ^ 
(cQdetr. scp)J 



MFCC Files 

50001. rrf c 

50002. rrf c 

50003. rrfc 
etc 



Fig. Step 5 

Assuming that the above script is stored in the file pd^etr . scp, the training data would be coded 
by executing 

HCopy -T 1 -C config -S codetr.scp 

This is illustrated in Fig. 3.6. A similar procedure is used to^de the test data (using TARGETKIND = MFCC_0_D_A 
in config) after which all of the pieces are in place to start t*aimng the HMMs. 



raimng 

t5 , 



3.2 Creating Monophone HMMs 

In this section, the creation of a well-trained set of single-Gaussi@, monophone HMMs will be 
described. The starting point will be a set of identical monophone HMMs in which every mean and 
variance is identical. These are then retrained, short-pause models are^Mided and the silence model 
is extended slightly. The monophones are then retrained. 

Some of the dictionary entries have multiple pronunciations. Howev^^)when HLEd was used 
to expand the word level MLF to create the phone level MLFs, it arbit^^ily selected the first 
pronunciation it found. Once reasonable monophone HMMs have been create^, the recogniser tool 
HVite can be used to perform a forced alignment of the training data. By this laeans, a new phone 
level MLF is created in which the choice of pronunciations depends on the acoustic evidence. This 
new MLF can be used to perform a final re-estimation of the monophone HMMsV^ 



3.2.1 Step 6 - Creating Flat Start Monophones 



The first step in HMM training is to define a prototype model. The parameters of this model 
are not important, its purpose is to define the model topology. For phone-based systems, a good 
topology to use is 3-state left-right with no skips such as the following 

~o <VecSize> 39 <MFCC_0_D_A> 
~h "proto" 



^ Not to be confused with files containing edit scripts 

* Most UNIX shells, especially the C shell, only allow a limited and quite small number of arguments. 
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<BeginHMM> 
<NumStates> 5 
<State> 2 
<Mean> 39 

0.0 0.0 0.0 
<Variajice> 39 
1.0 1.0 1.0 
<State> 3 
<Mean> 39 

0.0 0.0^0 
<Variance> 
1.0 1.0 1.' 
<State> 4 
<Mean> 39 

0.0 0.0 0.0 . 
<Variajice> 39 
1.0 1.0 1.0 
<TransP> 5 
0.0 1.0 0.0 0.0 0.0 
0.0 0.6 0.4 0.0 0.0 
0.0 0.0 0.6 0.4 0.0 
0.0 0.0 0.0 0.7 0.3 
0.0 0.0 0.0 0.0 0.0 



where each eUipsed vector is of length 39. T]^J^ number, 39, is computed from the length of the 
parameterised static vector (MFCC_0 13) plus^^e delta coefficients (+13) plus the acceleration 
coefficients (+13). ^ > , 

The HTK tool HCompV will scan a set of datawl^, compute the global mean and variance and 
set all of the Gaussians in a given HMM to have t]^ same mean and variance. Hence, assuming 
that a list of all the training ffies is stored in train. sct> the command 

(\ 

HCompV -C config -f 0.01 -m -S train. scp -W« hmmO proto 

will create a new version of proto in the directory hmmO in which the zero means and unit variances 
above have been replaced by the global speech means an(5 variances. Note that the prototype 
HMM defines the parameter kind as MFCC_0_D_A (Note: 'zemvnot 'oh'). This means that delta 
and acceleration coefficients are to be computed and appendd^^o the static MFCC coefficients 
computed and stored during the coding process described above. T^^nsure that these are computed 
during loading, the configuration file config should be modified to<3^nge the target kind, i.e. the 
configuration file entry for TARGETKIND should be changed to ^^j^ 

TARGETKIND = MFCC_0_D_A vt^) 

HCompV has a number of options specified for it. The -f option cans ^variance floor macro 
(called vFloors) to be generated which is equal to 0.01 times the global vari^ce. This is a vector of 
values which will be used to set a floor on the variances estimated in the sub^quent steps. The -m 
option asks for means to be computed as well as variances. Given this new pro1i<5lype model stored 
in the directory hmmO, a Master Macro File (MMF) called himndef s containing ar-sppy for each of 
the required monophone HMMs is constructed by manually copying the prototyp^^nd relabeling 
it for each required monophone (including "sil"). The format of an MMF is simila^o that of an 
MLF and it serves a similar purpose in that it avoids having a large number of individual HMM 
definition files (see Fig. 3.7). 
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macros 



hmmdef s 



<VecSize> 39 
<MFCC_0_D_A> 
~v "varFloorl " 
<Variance> 3 9 

0.0012 0.0003 



aa" 

<BeginHMM> 
<EndHMM> 

h "eh" 

<BeginHMM> 
<EndHMM> 

. . etc 



3.7 Form of Master Macro Files 



The flat start monophones st^^^ in the directory hininO are re-estimated using the embedded 
re-estimation tool HERest invoked-^ follows 

HERest -C config -I phone sO .m2PI\ -t 250.0 150.0 1000.0 \ 
-S train. scp -H hmmO/macros -E,4immO/hmmdef s -M himnl monophonesO 

The effect of this is to load all the modelsx^ 'hmmO which are listed in the model list monophonesO 
(monophonesl less the short pause (sp) mo4^- These are then re-estimated them using the data 
listed in train. scp and the new model set ilC^red in the directory hminl. Most of the files used 
in this invocation of HERest have already beendescribed. The exception is the file macros. This 
should contain a so-called global options macroxand the variance floor macro vFloors generated 
earlier. The global options macro simply defines fl^HMM parameter kind and the vector size i.e. 



"o <MFCC 0 D A> <VecSize> 39 



See Fig. 3.7. This can be combined with vFloors into ^^^xt file called macros. 



Prototype 
Definition 
(proto) 



HCompV 



Training Files 
listed in 

(t r ai n. scp) 



^ hmmO 
macr OS 
hrrmdef s 




HERest 



rracros 
hrrmdefs J 



Phone Level 
Transcription 

(phonesO .mlf) 



o 

% 



Fig. 3.8 Step 6 



The -t option sets the pruning thresholds to be used during training. Pruning limits the range of 
state alignments that the forward-backward algorithm includes in its summation and it can reduce 
the amount of computation required by an order of magnitude. For most training files, a very tight 
pruning threshold can be set, however, some training files will provide poorer acoustic matching 
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and in consequence a wider pruning beam is needed. HERest deals with this by having an auto- 
incrementing pruning threshold. In the above example, pruning is normally 250.0. If re-estimation 
fails on any particular file, the threshold is increased by 150.0 and the file is reprocessed. This is 
repeated until either the file is successfully processed or the pruning limit of 1000.0 is exceeded. At 
this point it is safe to assume that there is a serious problem with the training file and hence the 
fault should be fixed (typically it will be an incorrect transcription) or the training file should be 
discarded. The process leading to the initial set of monophones in the directory himnO is illustrated 
in Fig. 3.8. 

Each time HERest is run it performs a single re-estimation. Each new HMM set is stored in 
a new directory. Exe^Jjon of HERest should be repeated twice more, changing the name of the 
input and output direq^ries (set with the options -H and -M) each time, until the directory hrnmS 
contains the final set ofi^ialised monophone HMMs. 



3.2.2 Step 7 - Fixin^he Silence Models 




shared 
state 



Fig. 3.9 Silenci^^^dels 

CO 

The previous step has generated a 3 state left-to-right HMM for each phone and also a HMM 
for the silence model sil. The next step is to add extra trMifSftions from states 2 to 4 and from 
states 4 to 2 in the silence model. The idea here is to makcTnesnodel more robust by allowing 
individual states to absorb the various impulsive noises in theXraining data. The backward skip 
allows this to happen without committing the model to transit tcCi^e following word. 

Also, at this point, a 1 state short pause sp model should be cre^^. This should be a so-called 
tee-model which has a direct transition from entry to exit node. This*srt\ has its emitting state tied 
to the centre state of the silence model. The required topology of theTsto silence models is shown 
in Fig. 3.9. "^V-n 

These silence models can be created in two stages 

• Use a text editor on the file hmmS/hmmdef s to copy the centre state of the sil model to make 
a new sp model and store the resulting MMF hmmdef s, which includes the new sp model, in 
the new directory hiimi4. 

• Run the HMM editor HHEd to add the extra transitions required and tie t^5«D state to the 
centre sil state <^ 

HHEd works in a similar way to HLEd. It applies a set of commands in a script to modify a 
set of HMMs. In this case, it is executed as follows 

HHEd -H hmm4/macros -H hinin4/hinmdef s -M hmiiiS sil.hed monophonesl 

where sil.hed contains the following commands 

AT 2 4 0.2 {sil.transP} 

AT 4 2 0.2 {sil.transP} 

AT 1 3 0.3 {sp. trans?} 

Tl silst {sil . state [3] , sp . state [2] } 
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The AT commands add transitions to the given transition matrices and the final TI command creates 
a tied-state called silst. The parameters of this tied-state are stored in the hmmdef s file and within 
each silence model, the original state parameters are replaced by the name of this macro. Macros are 
described in more detail below. For now it is sufficient to regard them simply as the mechanism by 
which HTK implements parameter sharing. Note that the phone list used here has been changed, 
because the original list monophonesO has been extended by the new sp model. The new file is 
called monophonesl and has been used in the above HHEd command. 



sil -> sv/ 
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3.10 step 7 

\ 

Finally, another two passes of HERESTyare applied using the phone transcriptions with sp 
models between words. This leaves the set oi^onophone HMMs created so far in the directory 
hmrnJ. This step is illustrated in Fig. 3.10 

3.2.3 Step 8 - Realigning the Trainin^^ata 

As noted earlier, the dictionary contains multiple prorfi^pteiations for some words, particularly func- 
tion words. The phone models created so far can be u^^Sto realign the training data and create 
new transcriptions. This can be done with a single invoca^^p of the HTK recognition tool HVite, 



HVite -1 -o SWT -b silence -C config -a -H 

-H hmmy/hmmdef s -i aligned. mlf -m -t 250 
-1 words. mlf -S train. scp diet monophone 



esl Q 



.7/macros \ 
lab \ 



This command uses the HMMs stored in hmm7 to transform thcj^^^^ut word level transcription 
words. mlf to the new phone level transcription aligned. mlf using.^^^ pronunciations stored in 
the dictionary diet (see Fig 3.11). The key difference between this opera^n and the original word- 
to-phone mapping performed by HLEd in step 4 is that the recogniser cbngiders all pronunciations 
for each word and outputs the pronunciation that best matches the acoustie^ata. 

In the above, the -b option is used to insert a silence model at the(srart and end of each 
utterance. The name silence is used on the assumption that the dictionary contains an entry 

silence sil 

Note that the dictionary should be sorted firstly by case (upper case first) and sec9ri 
cally. The -t option sets a pruning level of 250.0 and the -o option is used to suppros 
of scores, word names and time boundaries in the output MLF. 



lly alphabeti- 
3 the printing 
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Fig. 3.11 Step 8 



Once the new phone ahgnme ive been created, another 2 passes of HERest can be applied 
to reestimate the HMM set paraiS^^rs again. Assuming that this is done, the final monophone 
HMM set will be stored in directory (^in9. 

3.3 Creating Tied-Statd^3>iphones 

Given a set of monophone HMMs, the finaf^^kge of model building is to create context-dependent 
triphone HMMs. This is done in two steps. Firstly, the monophone transcriptions are converted to 
triphone transcriptions and a set of triphone nso^ls are created by copying the monophones and 
re-estimating. Secondly, similar acoustic states w these triphones are tied to ensure that all state 
distributions can be robustly estimated. 

3.3.1 Step 9 - Making Triphones from ^onophones 

Context-dependent triphones can be made by simply cl^mng monophones and then re-estimating 
using triphone transcriptions. The latter should be creatd^Rrst using HLEd because a side-effect 
is to generate a list of all the triphones for which there is at |east one example in the training data. 
That is, executing 

HLEd -n triphonesl -1 -i wintri .mlf mktri . led^^igned .mlf 

will convert the monophone transcriptions in aligned. mlf to ar^^^mivalent set of triphone tran- 
scriptions in wintri .mlf . At the same time, a list of triphones isvj^mten to the file triphonesl. 
The edit script mktri . led contains the commands \^ 



WB sp 
WB sil 
TC 



<6 



The two WB commands define sp and sil as word boundary symbols. These thenilock the addition 
of context in the TI command, seen in the following script, which converts all pSacties (except word 
boundary symbols) to triphones . For example, (3, 



sil th ih s sp m ae n sp 



becomes 



sil th+ih th-ih+s ih-s sp m+ae m-ae+n ae-n sp ... 

This style of triphone transcription is referred to as word internal. Note that some biphones will 
also be generated as contexts at word boundaries will sometimes only include two phones. 
The cloning of models can be done efficiently using the HMM editor HHEd: 



HHEd -B -H hmm9/macros -H hmm9/hmmdefs -M hmmlO 
mktri. hed monophonesl 
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where the edit script mktri.hed contains a clone command CL followed by TI commands to tie all 
of the transition matrices in each triphone set, that is: 

CL triphonesl 

TI T_ah {(*-ah+*,ah+*,*-ah) .transP} 
TI T_ax {(*-ax+*,ax+*,*-ax) .transP} 
TI T_ey {(*-ey+*,ey+*,*-ey) .transP} 
TI T_b {(*-b+*,b+*,*-b) .trans?} 
TI T_ay {(*-ay+*,ay+*,*-ay) .transP} 



The file mktri .hed can^^generated using the PerZ script maketrihed included in the HTKTutorial 
directory. When running^e HHEd command you will get warnings about trying to tie transition 
matrices for the sil and spmodels. Since neither model is context-dependent there aren't actually 
any matrices to tie. * 

The clone command CL tefi^ as its argument the name of the file containing the list of triphones 
(and biphones) generated aboVa) For each model of the form a-b+c in this list, it looks for the 
monophone b and makes a copy <^^. Each TI command takes as its argument the name of a macro 
and a list of HMM components. Tn&^tter uses a notation which attempts to mimic the hierarchical 
structure of the HMM parameter stfun which the transition matrix trsinsP can be regarded as a 
sub-component of each HMM. The lisp'of items within brackets are patterns designed to match the 
set of triphones, right biphones and le:^^iphones for each phone. 



-h "t-ah+p" 



<transP> 
0.0 1.0 0.0 .. 
0.0 0.4 0.6 .. 



-h "t-ah+b" 



<transP> 
0.0 1.0 0.0 .. 
0.0 0.4 0.6 .. 



-t "T ah" 



<transP> 
0.0 1.0 0.0 .. 
0.0 0.4 0.6 .. 



~h "t-ah+p" 




Fig. 3.12 Tying Transition Matrices 



Up to now macros and tying have only been mentioned in passing. Althoug^^ full explanation 
must wait until chapter 7, a brief explanation is warranted here. Tying means fitat one or more 
HMMs share the same set of parameters. On the left side of Fig. 3.12, two HMl^JI^sfinitions are 
shown. Each HMM has its own individual transition matrix. On the right side, t^e effect of the 
first TI command in the edit script mktri.hed is shown. The individual transition matrices have 
been replaced by a reference to a macro called T_cLti which contains a matrix shared by both models. 
When reestimating tied parameters, the data which would have been used for each of the original 
untied parameters is pooled so that a much more reliable estimate can be obtained. 

Of course, tying could affect performance if performed indiscriminately. Hence, it is important 
to only tie parameters which have little effect on discrimination. This is the case here where the 
transition parameters do not vary significantly with acoustic context but nevertheless need to be 
estimated accurately. Some triphones will occur only once or twice and so very poor estimates 
would be obtained if tying was not done. These problems of data insufficiency will affect the output 
distributions too, but this will be dealt with in the next step. 
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Hitherto, all HMMs have been stored in text format and could be inspected like any text file. 
Now however, the model files will be getting larger and space and load/store times become an issue. 
For increased efficiency, HTK can store and load MMFs in binary format. Setting the standard -B 
option causes this to happen. 
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Fig. 3.i3^ Step 9 

Once the context-dependent models have been cloi>6cL the new triphone set can be re-estimated 
using HERest. This is done as previously except thattii| monophone model list is replaced by a 
triphone list and the triphone transcriptions are used inpl^e of the monophone transcriptions. 

For the final pass of HERest, the -s option should be Vj^ed to generate a file of state occupation 
statistics called stats. In combination with the means and mriajices, these enable likelihoods to be 
calculated for clusters of states and are needed during the stal^^lustering process described below. 
Fig. 3.13 illustrates this step of the HMM construction procedurg^ Re-estimation should be again 
done twice, so that the resultant model sets will ultimately be sa^e^ in hinml2. 

HERest -B -C config -I wintri.mlf -t 250.0 150.0 1000@,-s stats \ 
-S train. scp -H hmmll/macros -H hmmll/hmmdef s -M hmml^jl^riphonesl 

3.3.2 Step 10 - Making Tied-State Triphones q 

The outcome of the previous stage is a set of triphone HMMs with all ti^^ones in a phone set 
sharing the same transition matrix. When estimating these models, many of the variances in the 
output distributions will have been floored since there will be insufficient date associated with 
many of the states. The last step in the model building process is to tie states wimin triphone sets 
in order to share data and thus be able to make robust parameter estimates. \y. 

In the previous step, the TI command was used to explicitly tie all members of a>CT) of transition 
matrices together. However, the choice of which states to tie requires a bit more subtlety since the 
performance of the recogniser depends crucially on how accurate the state output distributions 
capture the statistics of the speech data. 

HHEd provides two mechanisms which allow states to be clustered and then each cluster tied. 
The first is data-driven and uses a similarity measure between states. The second uses decision trees 
and is based on asking questions about the left and right contexts of each triphone. The decision 
tree attempts to find those contexts which make the largest difference to the acoustics and which 
should therefore distinguish clusters. 

Decision tree state tying is performed by running HHEd in the normal way, i.e. 
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HHEd -B -H hinml2/macros -H hmml2/hmmdef s -M hmml3 \ 
tree.hed triphonesl > log 

Notice that the output is saved in a log file. This is important since some tuning of thresholds is 
usually needed. 

The edit script tree.hed, which contains the instructions regarding which contexts to examine 
for possible clustering, can be rather long and complex. A script for automatically generating this 
file, mkclscript, is found in the RM Demo. A version of the tree.hed script, which can be used 
with this tutorial, is included in the HTKTutorial directory. Note that this script is only capable 
of creating the TB c^^nmands (decision tree clustering of states). The questions (QS) still need 
defining by the user. Tnare is, however, an example list of questions which may be suitable to some 
tasks (or at least usefu^^an example) supplied with the RM demo (lib/quests. hed). The entire 
script appropriate for clu^^ing English phone models is too long to show here in the text, however, 
its main components are gi^^en by the following fragments: 

RO 100.0 stats 

TR 0 C>) 

QS "L_Class-Stop" {p-*,b-v<pt-*,d-*,k-*,g-*> 

QS "R_Class-Stop" {*+p,*+^^^t,*+d,*+k,*+g} 

QS "L_Nasal" {m-* ,n-* ,ng-*r - 

QS "R_Nasal" {*+m, *+n, *+ng} 

QS "L_Glide" {y-*,w-*} 

QS "R_Glide" {*+y,*+w} 

QS "L_w" -Cw-*} 

QS "R_w" 

QS "L_y" -Cy-*} 

QS "R_y" ■[*+y} 

QS "L_z" -Cz-*} 

QS "R_z" {*+z> 

TR 2 

TB 350.0 "aa_s2" {(aa, *-aa, *-aa+*, aa+*) . state [^] } 

TB 350.0 "ae_s2" {(ae, *-ae, *-ae+*, ae+*) . stat**!^] } 

TB 350.0 "ah_s2" {(ah, *-ali, *-ah+*, ah+*). stated 

TB 350.0 "uh_s2" {(uh, *-ah., *-uh+*, uh+*) . state [2]^ 

TB 350.0 "y_s4" {(y, *-y, *-y+*, y+*) . state [4] } 
TB 350.0 "z_s4" {(z, *-z, *-z+* , z+*) . state [4] } \^ 
TB 350.0 "zh_s4" {(zh, *-zh, *-zh+*, zh+*) . state [4] > 




TR 1 

AU "fulllist" 
CO "tiedlist" 



ST "trees" 



o 

% 

Firstly, the RO command is used to set the outlier threshold to 100.0 and load the statistics file 
generated at the end of the previous step. The outlier threshold determines the minimum occupancy 
of any cluster and prevents a single outlier state forming a singleton cluster just because it is 
acoustically very different to all the other states. The TR command sets the trace level to zero 
in preparation for loading in the questions. Each QS command loads a single question and each 
question is defined by a set of contexts. For example, the first QS command defines a question called 
L_Class-Stop which is true if the left context is either of the stops p, b, t, d, k or g. 
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3.14 Step 10 

Notice that for a triphone system, it is nec^^ry to include questions referring to both the right 
and left contexts of a phone. The questions shouldrprogress from wide, general classifications (such 
as consonant, vowel, nasal, diphthong, etc.) to specif instances of each phone. Ideally, the full set 
of questions loaded using the QS command would ii^clude every possible context which can influence 
the acoustic realisation of a phone, and can include(any linguistic or phonetic classification which 
may be relevant. There is no harm in creating extra(tmnecessary questions, because those which 
are determined to be irrelevant to the data will be ignore?^. 

The second TR command enables intermediate level preeress reporting so that each of the fol- 
lowing TB commands can be monitored. Each of these Iv^'commands clusters one specific set of 
states. For example, the first TB command applies to the first^eniHting state of all context-dependent 
models for the phone aa. 

Each TB command works as follows. Firstly, each set of stateg^efined by the final argument is 
pooled to form a single cluster. Each question in the question seHjOaded by the QS commands is 
used to split the pool into two sets. The use of two sets rather tharrmie, allows the log likelihood of 
the training data to be increased and the question which maximisesj>«ls increase is selected for the 
first branch of the tree. The process is then repeated until the increaS^n log likelihood achievable 
by any question at any node is less than the threshold specified by the^^ argument (350.0 in this 
case) . (~\ 

Note that the values given in the RD and TB commands affect the degree;»f tying and therefore 
the number of states output in the clustered system. The values should be varied according to the 
amount of training data available. As a final step to the clustering, any pair^f clusters which can 
be merged such that the decrease in log likelihood is below the threshold is mer^3i. On completion, 
the states in each cluster i are tied to form a single shared state with macro namq^xjcx_i where xxx 
is the name given by the second argument of the TB command. 

The set of triphones used so far only includes those needed to cover the training^ata. The AU 
command takes as its argument a new list of triphones expanded to include all those needed for 
recognition. This list can be generated, for example, by using HDMan on the entire dictionary 
(not just the training dictionary), converting it to triphones using the command TC and outputting 
a list of the distinct triphones to a file using the option -n 

HDMan -b sp -n fulllist -g global. ded. -1 flog beep-tri beep 

The -b sp option specifies that the sp phone is used as a word boundary, and so is excluded from 
triphones. The effect of the AU command is to use the decision trees to synthesise all of the new 
previously unseen triphones in the new list. 
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Once all state-tying has been completed and new models synthesised, some models may share 
exactly the same 3 states and transition matrices and are thus identical. The CO command is used 
to compact the model set by finding all identical models and tying them together'^, producing a 
new list of models called tiedlist. 

One of the advantages of using decision tree clustering is that it allows previously unseen tri- 
phones to be synthesised. To do this, the trees must be saved and this is done by the ST command. 
Later if new previously unseen triphones are required, for example in the pronunciation of a new 
vocabulary item, the existing model set can be reloaded into HHEd, the trees reloaded using the 
LT command and then a new extended list of triphones created using the AU command. 

After HHEd has'^jnpleted, the effect of tying can be studied and the thresholds adjusted if 
necessary. The log file i^fll include summary statistics which give the total number of physical states 
remaining and the numo^of models after compacting. 

Finally, and for the Iffit time, the models are re-estimated twice using HERest. Fig. 3.14 
illustrates this last step in the HMM build process. The trained models are then contained in the 
file hmmlS/hmmdef s. \^ 

3.4 Recogniser Ev^^^tion 



The recogniser is now complete and^^ performance can be evaluated. The recognition network 
and dictionary have already been cor^*s:ucted, and test data has been recorded. Thus, all that 
is necessary is to run the recogniser asraithen evaluate the results using the HTK analysis tool 
HResults 

<^ 

3.4.1 Step 11 - Recognising the'^est Data 

Assuming that test.scp holds a list of the c^>Sed-test files, then each test file will be recognised 
and its transcription output to an MLF called r«cout .mlf by executing the following 

HVite -H hmmlS/macros -H hmmlS/hmmdef s -^^test.scp \ 
-1 -i recout.mlf -w wdnet \ ^ i 

-p 0.0 -s 5.0 diet tiedlist 

The options -p and -s set the word insertion penalty andrthe grammar scale factor, respectively. 
The word insertion penalty is a fixed value added to eacn^ token when it transits from the end of 
one word to the start of the next. The grammar scale factoP is>he amount by which the language 
model probability is scaled before being added to each token alQ^ transits from the end of one word 
to the start of the next. These parameters can have a significan^^e^ect on recognition performance 
and hence, some tuning on development test data is well worthwl^ii^. 

The dictionary contains monophone transcriptions whereas the supplied HMM list contains word 
internal triphones. HVite will make the necessary conversions wireh loading the word network 
wdnet. However, if the HMM list contained both monophones and co^aXWt-dependent phones then 
HVite would become confused. The required form of word-internaj'''@twork expansion can be 
forced by setting the configuration variable FORCECXTEXP to true and AI^WXWRDEXP to false (see 
chapter 12 for details). ^ 

Assuming that the MLF testref .mlf contains word level transcriptions- for each test file*', the 
actual performance can be determined by running HResults as follows * 

HResults -I testref .mlf tiedlist recout.mlf 

the result would be a print-out of the form 



====================== HTK Results Analysis 

Date: Sun Oct 22 16:14:45 1995 

Ref : testref s. mlf 

Rec : recout.mlf 

Overall Results 

SENT: y.Correct=98.50 [H=197, S=3, N=200] 



^ Note that if the transition matrices had not been tied, the CD command would be ineffective since all models 
would be different by virtue of their unique transition matrices. 

®The HLEd tool may have to be used to insert silences at the start and end of each transcription or alternatively 
HResults can be used to ignore silences (or any other symbols) using the -e option 
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WORD: 7.Corr=99.77, Acc=99.65 [H=853, D=l, S=l, 1=1, N=855] 
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The line starting with SENT: indicates that of the 200 test utterances, 197 (98.50%) were correctly 
recognised. The following line starting with WORD: gives the word level statistics and indicates that 
of the 855 words in total, 853 (99.77%) were recognised correctly. There was 1 deletion error (D), 
1 substitution error (S) and 1 insertion error (I). The accuracy figure (Acc) of 99.65% is lower 
than the percentage correct (Cor) because it takes account of the insertion errors which the latter 
ignores. 
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(%g. 3.15 Step 11 

3.5 Running the Recogni^^ Live 

The recogniser can also be run with live input. ^^^Jp this it is only necessary to set the configuration 
variables needed to convert the input audio to me cerrect form of parameterisation. Specifically, 
the following needs to be appended to the configmatjpn file conf ig to create a new configuration 
file config2 C 



# Waveform capture 
S0URCERATE=625 . 0 
SOURCEKIND=HAUDIO 
SOURCEFORMAT=HTK 
ENORMALISE=F 
USESILDET=T 
MEASURES IL=F 
OUTSILWARN=T 



o 



These indicate that the source is direct audio with sample period D275nLisecs. The silence detector 
is enabled and a measurement of the background speech/silence levels^^hculd be made at start-up. 
The final line makes sure that a warning is printed when this silence meaeurement is being made. 

Once the configuration file has been set-up for direct audio input, H^V^E can be run as in the 
previous step except that no files need be given as arguments 



HVite -H hmmlB/macros 
-w wdnet -p 0.0 



-H hmmlS/hmmdef s -C config2 \ 
-s 5.0 diet tiedlist 



o 



On start-up, HVite will prompt the user to speak an arbitrary sentence (apgrax. 4 sees) in 
order to measure the speech and background silence levels. It will then repeatedly retrognise and, if 
trace level bit 1 is set, it will output each utterance to the terminal. A typical session is as follows 

Read 1648 physical / 4131 logical HMMs 
Read lattice with 26 nodes / 52 arcs 
Created network with 123 nodes / 151 links 



READY [1]> 

Please speak sentence - measuring levels 
Level measurement completed 
DIAL FOUR SIX FOUR TWO FOUR OH 
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== [303 frames] -95.5773 [Ac=-28630.2 LM=-329.8] (Act=21.8) 

READY [2] > 
DIAL ZERO EIGHT SIX TWO 

== [228 frames] -99.3758 [Ac=-22402.2 LM=-255.5] (Act=21.8) 

READY [3] > 
etc 

During loading, information will be printed out regarding the different recogniser components. The 
physical models are tl^distinct HMMs used by the system, while the logical models include all 
model names. The numom of logical models is higher than the number of physical models because 
many logically distinct ^fetiels have been determined to be physically identical and have been 
merged during the previous model building steps. The lattice information refers to the number of 
links and nodes in the recog^^ion syntax. The network information refers to actual recognition 
network built by expanding tj,i^yattice using the current HMM set, dictionary and any context 
expansion rules specified. Aftei'''each utterance, the numerical information gives the total number 
of frames, the average log likelihoocLper frame, the total acoustic score, the total language model 
score and the average number of niseis active. 

Note that if it was required to re^^nise a new name, then the following two changes would be 
needed ^ 

1. the grammar would be altered to i^fi^ude the new name 

2. a pronunciation for the new name wouj/ibe added to the dictionary 



If the new name required triphones which did'^^ exist, then they could be created by loading the 
existing triphone set into HHEd, loading the de^lSon trees using the LT command and then using 
the AU command to generate a new complete triphojie set. 

3.6 Adapting the HMMs V> 

The previous sections have described the stages required to build a simple voice dialling system. 
To simplify this process, speaker dependent models wenofleveloped using training data from a 
single user. Consequently, recognition accuracy for any otl^r users would be poor. To overcome 
this limitation, a set of speaker independent models could bw^nstructed, but this would require 
large amounts of training data from a variety of speakers. An alternative is to adapt the current 
speaker dependent models to the characteristics of a new speakerusing a small amount of training or 
adaptation data. In general, adaptation techniques are applied to Viell trained speaker independent 
model sets to enable them to better model the characteristics of p^^jbular speakers. 

HTK supports both supervised adaptation, where the true transcrij^^n of the data is known and 
unsupervised adaptation where the transcription is hypothesised. In,^[^Jji supervised adaptation 
is performed offline by HERest using maximum likelihood linear transformations (for example 
MLLR, CMLLR) and/or maximum a-posteriori (MAP) techniques to estirrtaw>a series of transforms 
or a transformed model set, that reduces the mismatch between the current model set and the 
adaptation data. Unsupervised adaptation is provided by HVite, using just lijiear transformations. 

The following sections describe oflline supervised adaptation (using MLLfr)^with the use of 

MEREST. Q 



3.6.1 Step 12 - Preparation of the Adaptation Data 



As in normal recogniser development, the first stage in adaptation involves data preparation. Speech 
data from the new user is required for both adapting the models and testing the adapted system. 
The data can be obtained in a similar fashion to that taken to prepare the original test data. Initially, 
prompt lists for the adaptation and test data will be generated using HSGen. For example, typing 



HSGen -1 -n 20 wdnet diet > promptsAdapt 
HSGen -1 -n 20 wdnet diet > promptsTest 
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would produce two prompt files for the adaptation and test data. The amount of adaptation data 
required will normally be found empirically, but a performance improvement should be observable 
after just 30 seconds of speech. In this case, around 20 utterances should be sufficient. HSLab can 
be used to record the associated speech. 

Assuming that the script files codeAdapt . scp and codeTest . scp list the source and output files 
for the adaptation and test data respectively then both sets of speech can then be coded using the 
HCoPY commands given below. 

HCopy -C config -S codeAdapt . scp 
HCopy -C conf^i^^-S codeTest.scp 



The final stage of prejf^iSition involves generating context dependent phone transcriptions of the 
adaptation data and wo^jjlevel transcriptions of the test data for use in adapting the models 
and evaluating their performance. The transcriptions of the test data can be obtained using 
prompts2mlf . To minimize the problem of multiple pronunciations the phone level transcriptions 
of the adaptation data can cre^^rfbtained by using HVite to perform a forced alignment of the 
adaptation data. Assuming tha^word level transcriptions are listed in adaptWords .mlf , then the 
following command will place th^^h^ne transcriptions in adaptPhones .mlf . 



HVite -1 -o SWT -b sileace -C config -a -H himnl5/macros \ 

-H hmmlB/hmmdef s -i aaaptPhones .mlf -m -t 250.0 \ 
-I adaptWords . mlf -y l^ls) -S adapt. scp diet tiedlist 

3.6.2 Step 13 - Generating th<^'^ansforms 

HERest provides support for a range of lin^r transformations and possible number of transfor- 
mations. Regression class trees can be used t(5v^Vnamically specify the number of transformations 
to be generated, or the number may be pre-det^f^ined using a set of baseclasses. The HTK tool 
HHEd can be used to build a regression class tree^tfd store it along with a set of baseclasses. For 
example, 

HHEd -B -H hmml5/macros -H hmml5/hmmdef s ^M> classes regtree. hed tiedlist 

creates a regression class tree using the models storecniiiJimmlB and stores the regression class 
tree and base classes in the classes directory. The HHEiyedit script regtree. hed contains the 
following commands • > 

LS "hmml5/stats" 

RC 32 "rtree" \ 

o 

The RN command assigns an identifier to the HMM set. The LS comrjl^d loads the state occupation 
statistics file stats generated by the last application of HERest wnim ereated the models in hmml5. 
The RC command then attempts to build a regression class tree with 32-agrminal or leaf nodes using 
these statistics. In addition a global transform is used as the default. ^^Tliis baseclass for this must 
still be specified, using in the file "global" for example 

~h "global" ^ 
<MMFIDMASK> * • 
<PARAMETERS> MIXBASE O 
<NUMCLASSES> 1 Q 
<CLASS> 1 {*.state[2-4] .mix[l-12]} 



This file should be be added to the classes directory. 

HERest and HVite can be used to perform static adaptation, where all the adaptation data 
is processed in a single block. Note as with standard HMM training HERest will expect the 
list of model names. In contrast HVite only needs to list of words. HVite can also be used for 
incremental adaptation. In this tutorial the use of static adaptation with HERest will be described 
with MLLR as the form of linear adaptation. 

The example use of HERest for adaptation involves two passes. On the first pass a global 
adaptation is performed. The second pass then uses the global transformation as an input trans- 
formation, to transform the model set, producing better frame/state alignments which are then 



3.6 Adapting the HMMs 



45 



used to estimate a set of more specific transforms, using a regression class tree. After estimating 
the transforms, HERest can output either the newly adapted model set or, in the default set- 
ting, the transformations themselves in cither a transform model file (TMF) or as a set of distinct 
transformations . The latter forms can be advantageous if storage is an issue since the TMFs (or 
transforms) are significantly smaller than MMFs and the computational overhead incurred when 
transforming a model set using a transform is negligible. 

The two applications of HERest below demonstrate a static two-pass adaptation approach 
where the global and regression class transformations are stored in the directory xforms with file 
extensions mllrl for the global transform and mllr2 for multiple regression class system. 

HERest -C conf^^-C conf ig. global -S adapt. scp -I adaptPhones .mlf \ 

-H limml5/«4pros -u a -H hmml5/hmmdef s -z -K xforms mllrl -J classes \ 
-h '*/7,7.y./?^*.mfc' tiedlist 



HERest -a -C conf ig .^-^ conf ig.rc -S adapt. scp -I adaptPhones .mlf \ 

-H hmml5/macros^u a -H hmml5/hmmdef s -J xforms mllrl -K xforms mllr2 \ 
-J classes -h '^TLVL'Ll *.mfc' tiedlist 

where config. global has the form 

HADAPT : TRANSK I ND 0 MLLRMEAN 

HADAPT : USEE IAS =(^UE 
HADAPT :EASECLASS = ^bal 



HADAPT: AD APTKIND = BAS^' 

HADAPT :KEEPXF0RMD1ST1NCT = TRW > 

HADAPT: TRACE =61 \y 
HMODEL: TRACE =512 

config. rc has the form ^ 

HADAPT: TRANSK IND = MLLRMEAN > 

HADAPT: USEE IAS = TRUE 

HADAPT : REGTREE = rtree.tree C 

HADAPT: AD APTKIND = TREE 

HADAPT : SPLITTHRESH = 1000.0 , 

HADAPT :KEEPXF0RMD1ST1NCT = TRUE 

HADAPT: TRACE =61 

HMDDEL: TRACE =512 O 

The last two entries yield useful log information to do with whiclr~tra*isforms are being used and 
from where, -h is a mask that is used to detect when the speaker Vhanges and also is used to 
determine the name of the speaker transform. File Masks may also be^^^icrately specified specified 
using configuration variables: ^) 

INXFORMMASK ^ 
PAXFORMMASK • 

The output transform mask is assumed to be specified using the -h option and ^J^^^f^'Ult the input 
and parent transforms are assumed to be the same. 

One important difference between the standard HMM macros and the adapta^n macros, is 
that for adaptation multiple directories may be specified using the -J option to search for the 
appropriate macro. This is useful when using multiple parent transforms. The set of adaptation 
transforms are: a, b, r, f, g, x, y, j. The -J (along with the -K and -E for output and 
parent transforms respectively) flag takes an optional arguement that specifies the input transform 
transform file extension. For the -J fiag this can only be speicified on the first time that a -J fiag is 
encountered in the command line. It is strongly recommended that this option is used as it allows 
easy tracking of transforms. 
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3.6.3 Step 14 - Evaluation of the Adapted System 

To evaluate the performance of the adaptation, the test data previously recorded is recognised using 
HViTE. Assuming that testAdapt . scp contains a list of all of the coded test files, then HVite 
can be invoked in much the same way as before but with the additional -J argument used to load 
the model transformation file and baseclasses. 



HVite -H hmmlB/macros -H hmmlS/hmmdef s -S testAdapt . scp -1 \ 

-J xf orms\ mllr2 -h ' */°L°L°/°/°/°L_* -nifc' -k -i recoutAdapt .mlf -w wdnet \ 
-J class^^-C config -p 0.0 -s 5.0 diet tiedlist 



The results of the adapteomodel set can then be observed using HResults in the usual manner. 

The RM Demo contaiifs a section on speaker adaptation (section 10). These describes the 
various options available in oata^ along with example configuration files that may be used. 

3.7 Semi-Tied and^l^DA transforms 

HERest also supports estimation o:(^semi-tied transform. Here only a global semitied transform 
is described, however multiple basecla^S^ can be used. A new configuration file, config. semi, is 
required. This contains 

HADAPT:TRANSKIND = SEJ^' 

HADAPT:USEBIAS = FALSeS 

HADAPT:BASECLASS = global rx 

HADAPT : SPLITTHRESH =0.0 ^ 

HADAPT : MAXXFORMITER = 100 C 

HADAPT :MAXSEMITIEDITER =20 \^ 



HADAPT: TRACE =61 
HMODEL: TRACE =512 



The global macros in step 13 is required to have been g^^rated. The example command below 
can then be run. This generate a new model set stored m hmml6 and a semitied transform in 
hmmie/SEMITlED. *^ 

y 

HERest -C config -C config. semi -S train. scp -1 adapTsPhones .mlf \ 

-H hmml5/macros -u stw -H hmml5/hmmdef s -K nnasl6 -M hmml6 tiedlist 




An additional iteration of HERest can then be run using 

HERest -C config -S train. scp -1 adaptPhones .mlf -H hminl6/imoros -u tmvw\ 
-J hmml6 -J classes -H hmml6/hmmdef s -M hmml7 tiedlist 

To evaluate the semi-tied estimated model the following command can be used 

HVite -H hmml7/macros -H hmml7/hmmdef s -S testAdapt . scp -1 \ 
-J hmml6 -J classes -i recoutAdapt .mlf -w wdnet \ 
-C config -p 0.0 -s 5.0 diet tiedlist 



Note the -J options must be included as the semi-tied transform is stored in the same fashion as 
the adaptation transforms. Thus the transform itself is stored in directory hmml6 and the global 
base class in classes. 

There are a number of useful other options that may be explored using, for example HLDA. If 
config. semi is replaced by config. hlda containing 
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HADAPT:TRANSKIND 



= SEMIT 



HADAPT:USEBIAS 
HADAPT:BASECLASS 
HAD APT : SPLITTHRESH 
HAD APT : MAXXFORMITER 
HADAPT : MAXSEMITIEDITER 
HAD APT : SEMITIED2INPUTXF0RM 
HADAPT : NUMNUISANCEDIM 
HADAPT : SEMITIEDMA^ 




= FALSE 
= global 



= 0.0 
= 100 
= 20 



= TRUE 
= 5 



= HLDA 



HADAPT: TRACE = 6f> 
HMDDEL: TRACE = 512^ 



An HLDA InputXForm thatrfei^uces the number of dimensions by 5 is estimated and stored with 
the model-set. A copy of the^teasisform is stored also stored in a file called himnl6/HLDA. For input 
transforms (and globla semi-tiea transforms) there are two forms in which the transform can be 
stored. First it may be stired aSj^n>AdaptXForm of type SEMIT. The second form is as an input 
transform. The latter is preferabl^j^ the feature- vector size is modified. The form of transform is 
determined by how HADAPT : SEMlTlEj^NPUTXFORM is set. 

One of the advantages of storing <er>global transform as an input transform is that there is 
no need to specify any -J options as^tpe INPUTXFORM is by default stored with the model set 
options. To prevent the INPUTXFORM being^tojed with the model set (for example to allow backward 
compatibility) set the foUowin configuratid^option 



This chapter has described the construction of a tied-sta^e phone-based continuous speech recogniser 
and in so doing, it has touched on most of the maia^reas addressed by HTK: recording, data 
preparation, HMM definitions, training tools, adaptatioi^^ols, networks, decoding and evaluating. 
The rest of this book discusses each of these topics in detfl^ 



3.8 Summary 



HMODEL : SAVEINPUTXFORM 
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Chapter 4 
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This chapter discusses the various ways of controUing thei''M)eration of HTK tools along with 
related aspects of file system organisation, error reporting and mtemory management. All of the 
operating system and user interface functions are provided by tne ilTK module HShell. Memory 
management is a low level function which is largely invisible to tWuser, but it is useful to have a 
basic understanding of it in order to appreciate memory requireme),i^)and interpret diagnostic out- 
put from tools. Low level memory management in HTK is provided byr ffiMEM and the management 
of higher level structures such as vectors and matrices is provided by ffl^TH. 

The behaviour of a HTK tool depends on three sources of informatitm*. Firstly, all HTK tools 
are executed by issuing commands to the operating system shell. Each corftmand typically contains 
the names of the various files that the tool needs to function and a numbeCoi optional arguments 
which control the detailed behaviour of the tool. Secondly, as noted in chapt^ 2 and shown in the 
adjacent figure, every HTK tool uses a set of standard library modules to inte^f^e to the various 
file types and to connect with the outside world. Many of these modules can W customised by 
setting parameters in a configuration file. Thirdly, a small number of parameters ara-specified using 
environment variables. \y 

Terminal output mostly depends on the specific tool being used, however, there are some generic 
output functions which are provided by the library modules and which are therefore common across 
tools. These include version reporting, memory usage and error reporting. 

Finally, HTK can read and write most data sources through pipes as an alternative to direct 
input and output from data files. This allows filters to be used, and in particular, it allows many 
of the external files used by HTK to be stored directly in compressed form and then decompressed 
on-the-fly when the data is read back in. 

All of the above is discussed in more detail in the following sections. 
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4.1 The Command Line 

The general form of command line for invoking a tool is^ 

tool [options] files . . . 

Options always consist of a dash followed by a single letter. Some options are followed by an 
argument as follows 

-i - a switch option 

-t 3 - an integer valued option 

-a 0.01 - a float valued option 

-s hello - a string valued option 



if 



Option names consisting ra a capital letter are common across all tools (see section 4.4). Integer 
arguments may be given in ftny of the standard C formats, for example, 13, OxD and 015 all represent 
the same number. Typing t^J^njime of a tool on its own always causes a short summary of the 
command line options to be pl<^ed in place of its normal operation. For example, typing 

HERest ^ i 

would result in the following outpulf^^ 

USAGE: HERest [options] hmmL^ dataFiles... 

Option Default 

-c f Mixture pruning tlireshoJ^ 10.0 

-d s dir to find hmm def initio](jJ\ current 

-m N set min exsunples needed per^-iJf()del 3 

-o s extension for new hmm files \» as src 



-p N set parallel mode to N ^ off 



The first line shows the names of the required files ancf'tjje rest consists of a listing of each option, 
its meaning, and its default value. ^ 

The precise naming convention for specifying files depapas on the operating system being used, 
but HTK always assumes the existence of a hierarchical fil% system and it maintains a distinction 
between directory paths and file names. 

In general, a file will be located either in the current directoryj,A)nie subdirectory of the current 
directory or some subdirectory of the root directory. For exampie^4^ the command 

HList si dir/s2 /users/sjy/speech/s3 

file si must be in the current directory, s2 must be in the directory dit;;^ithin the current directory 
and s3 must be in the directory /users/sjy/speech. V'C^ 

Some tools allow directories to be specified via configuration paranieters and command line 
options. In all cases, the final path character (eg / in UNIX) need not (bulnaay be) included. For 
example, both of the following are acceptable and have equivalent effect v 

• 

Hlnit -L mymodels/new/ hmmfile data* 
Hlnit -L mvmodels/new hmmfile data* 

where the -L option specifies the directory in which to find the label files associateid»rith the data 
files. ^ 



4.2 Script Files 

Tools which require a potentially very long list of files (e.g. training tools) always allow the files to 
be specified in a script file via the -S option instead of via the command line. This is particularly 
useful when running under an OS with limited file name expansion capability. Thus, for example, 
HInit may be invoked by either 



^AU of the examples in this book assume the UNIX Operating System and the C ShcU but the principles apply 
to any OS which supports hierarchical files and command line arguments 



4.3 Configuration Files 



51 



HInit hmmf ile si s2 s3 s4 s5 .... 

or 

HInit -S filelist hmmfile 

where filelist holds the hst of files si, s2, etc. Each file listed in a script should be separated by 
white space or a new line. Usually, files are listed on separate lines, however, when using HCopy 
which read pairs of files as its arguments, it is normal to write each pair on a single line. Script files 
should only be used for storing ellipsed file list arguments. Note that shell meta-characters should 
not be used in script'^gs and will not be interpreted by the HTK tools. 

Starting with HTK^^^l the syntax of script lies has been extended. In addition to directly 
specifying the name of a Bfcysical file it is possible to define aliases and to select a segment from a 



file. The general syntax ^^n extended filename is 
logf ile=physf ile [s 



where logf ile is the logical ^lename used by the HTK tools and will appear in mlf files and 
similar, physf ile is the physica(^ame of the actual file on disk that will be accessed and s and 
e are indices that can be used to\8^ct only a segment of the file. One example of a use of this 
feature is the evaluation of differenW^egmentations of the audio data. A new segmentation can be 
used by creating a new script file wifhmit having to create multiple copies of the data. 
A typical script file might look likexV 

s23-0001-A_000143_000291.plp=/datayplp'/coinplete/s23-0001-A.plp[143,291] 
s23-0001-A_000291_000500.plp=/data/^lpycomplete/s23-0001-A.plp[291,500] 
s23-0001-A_000500_000889 . plp=/data/pip/complete/s23-0001-A . pip [500 , 889] 

4.3 Configuration Files 

Configuration files are used for customising the HT^Nvorking environment. They consist of a list 
of parameter-values pairs along with an optional prefizf^hich limits the scope of the parameter to 
a specific module or tool. 

The name of a configuration file can be specified explicitly on the command line using the -C 
command. For example, when executing v-v 

HERest ... -C myconf ig si s2 s3 s4 ... 

The operation of HERest will depend on the parameter settiri;^!<^n the file myconf ig. 

When an explicit configuration file is specified, only those parg^jpters mentioned in that file are 
actually changed and all other parameters retain their default valu^Sy These defaults are built-in. 
However, user-defined defaults can be set by assigning the name br^^default configuration file to 
the environment variable HCONFIG. Thus, for example, using the UNIvti Shell, writing 



setenv HCONFIG myconf ig 
HERest ... si s2 s3 s4 . 





would have an identical effect to the preceding example. However, in this case, a further refinement 
of the configuration values is possible since the opportunity to specify an explioitvjonfiguration file 
on the command line remains. For example, in 

setenv HCONFIG myconfig 

HERest . . . -C xconf ig si s2 s3 s4 . . . ^ 
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Configuration Parameters 
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> Fig. 4.1 Defining a 

\^ Configuration 



the parameter values in xconf i^^i^ over-ride those in myconf ig which in turn will over- ride the 
built-in defaults. In practice, most JfTK users will set general-purpose default configuration values 
using HCDNFIG and will then over-ri^^these as required for specific tasks using the -C command 
line option. This is illustrated in Fig. where the darkened rectangles indicate active parameter 
definitions. Viewed from above, all of th^^maining parameter definitions can be seen to be masked 
by higher level over-rides. \» 

The configuration file itself consists of a SMuence of parameter definitions of the form 

[MODULE:] PARAMETER = VALUE ^ 

One parameter definition is written per line and(s^uare brackets indicate that the module name is 
optional. Parameter definitions are not case sensiti^*^* but by convention they are written in upper 
case. A # character indicates that the rest of the lin£^ a comment. 
As an example, the following is a simple configura^to file 

# Excunple config file 

TARGETKIND = MFCC /r\ 

NUMCHANS = 20 ^ 

WINDOWSIZE = 250000.0 # ie 25 msecs* > 

PREEMCOEF =0.97 \J 

ENORMALISE = T 
HSHELL: TRACE =02 # octal Q 

HP ARM: TRACE = 0101 

The first five lines contain no module name and hence they apply '^^bally, that is, any library 
module or tool which is interested in the configuration parameter N^^^^ANS will read the given 
parameter value. In practice, this is not a problem with library modules sinee>iiearly all configuration 
parameters have unique names. The final two lines show the same paraSMter name being given 
different values within different modules. This is an example of a parameter which every module 
responds to and hence does not have a unique name. • 

This example also shows each of the four possible types of value that can^i^ear in a config- 
uration file: string, integer, fioat and Boolean. The configuration parameter TAliSSTKIND requires 
a string value specifying the name of a speech parameter kind. Strings not starttp^with a letter 
should be enclosed in double quotes. NUMCHANS requires an integer value specifying the number 
of filter-bank channels to use in the analysis. WINDOWSIZE actually requires a fioating-point value 
specifying the window size in units of 100ns. However, an integer can always be given wherever a 
float is required. PREEMCOEF also requires a floating-point value specifying the pre-emphasis coef- 
ficient to be used. Finally, ENORMALISE is a Boolean parameter which determines whether or not 
energy normalisation is to be performed, its value must be T, TRUE or F, FALSE. Notice also that, 
as in command line options, integer values can use the C conventions for writing in non-decimal 
bases. Thus, the trace value of 0101 is equal to decimal 65. This is particularly useful in this case 
because trace values are typically interpreted as bit-strings by HTK modules and tools. 

If the name of a configuration variable is mis-typed, there will be no warning and the variable 
will simply be ignored. To help guard against this, the standard option -D can be used. This 
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displays all of the configuration variables before and after the tool runs. In the latter case, all 
configuration variables which are still unread are marked by a hash character. The initial display 
allows the configuration values to be checked before potentially wasting a large amount of cpu time 
through incorrectly set parameters. The final display shows which configuration variables were 
actually used during the execution of the tool. The form of the output is shown by the following 
example 

HTK Configuration Parameters [3] 

Module/Tool ParEimeter Value 



# ^ SAVEBINARY TRUE 

HP ARM TARGETRATE 256000.000000 

'Z^TtRGETKIND MFCC_0 



Here three configuration parameters have been set but the hash (#) indicates that SAVEBINARY has 
not been used. * 

4.4 Standard Options 

As noted in section 4.1, options cl^^isting of a capital letter are common across all tools. Many 
are specific to particular file types they will be introduced as they arise. However, there are 
six options that are standard across a^i^ools. Three of these have been mentioned already. The 
option -C is used to specify a configuraheto file name and the option -S is used to specify a script 
file name, whilst the option -D is used tovdi^ay configuration settings. 

The two remaining standard options pr>^vided directly by HShell are -A and -V. The option -A 
causes the current command line arguments"*w be printed. When running experiments via scripts, 
it is a good idea to use this option to record in^^og file the precise settings used for each tool. The 
option -V causes version information for the tool^nd each module used by that tool to be listed. 
These should always be quoted when making bug reports. 

Finally, all tools implement the trace option -T. Trace values are typically bit strings and the 
meaning of each bit is described in the reference section for each tool. Setting a trace option via 
the command line overrides any setting for that same(^gice option in a configuration file. This is a 
general rule, command line options always override defe^^^s set in configuration files. 

All of the standard options are listed in the final sumrntnsy section of this chapter. As a general 
rule, you should consider passing at least -A -D -V -T i^Yo all tools, which will guarantee that 
sufficient information is available in the tool output. • > 

4.5 Error Reporting 

The HShell module provides a standard mechanism for reporting^^^^rs and warnings. A typical 
error message is as follows 

HList: ERROR [+1110] 

IsWave: cannot open file speech.dat 

This indicates that the tool HLiST is reporting an error number +1110. errors have positive 
error numbers and always result in the tool terminating. Warnings have negative error numbers 
and the tool does not terminate. The first two digits of an error number indicat^T^e module or tool 
in which the error is located (HLiST in this case) and the last two digits define class of error. 
The second line of the error message names the actual routine in which the erro^^-Accurred (here 
IsWave) and the actual error message. All errors and warnings are listed in the refef^ce section at 
the end of this book indexed by error/warning number. This listing contains more details on each 
error or warning along with suggested causes. 

Error messages are sent to the standard error stream but warnings are sent to the standard 
output stream. The reason for the latter is that most HTK tools are run with progress tracing 
enabled. Sending warnings to the standard output stream ensures that they are properly interleaved 
with the trace of progress so that it is easy to determine the point at which the warning was issued. 
Sending warnings to standard error would lose this information. 

The default behaviour of a HTK tool on terminating due to an error is to exit normally returning 
the error number as exit status. If, however, the configuration variable ABDRTONERR is set to true 
then the tool will core dump. This is a debugging facility which should not concern most users. 
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4.6 Strings and Names 

Many HTK definition files include names of various types of objects: for example labels, model 
names, words, etc. In order to achieve some uniformity, HTK applies standard rules for reading 
strings which are names. These rules are not, however, necessary when using the language modelling 
tools - see below. 

A name string consists of a single white space delimited word or a quoted string. Either the 
single quote ' or the double quote " can be used to quote strings but the start and end quotes must 
be matched. The backslash \ character can also be used to introduce otherwise reserved characters. 
The character foUowi^^^ backslash is inserted into the string without special processing unless that 
character is a digit in range 0 to 7. In that case, the three characters following the backslash 
are read and interpretecr^ an octal character code. When the three characters are not octal digits 
the result is not well defied. 

In summary the special»processing is 



Notation 



W 



\niin 



-M&h] 



nmg 



;]^j^ejLts a space that will not terminate a string 



(aMa)will not end a quoted string) 



" (and^ll not end a quoted string) 



the character with octal code \nnn 



Note that the above allows the same ra^ct to be achieved in a number of different ways. For 
example, 

"\"QUOTE" 

\" QUOTE ^ 
"'QUOTE' ^ 
\042QU0TE 

all produce the string "QUOTE. 

The only exceptions to the above general rules are^^ 

• Where models are specified in HHEd scripts, commas (,), dots (.), and closing brackets ()) 
are all used as extra delimiters to allow HHEd scri]j^ created for earlier versions of HTK to 
be used unchanged. Hence for example, (a,b,c,d) woyld be split into 4 distinct name strings 
a, b, c and d. 

• When the configuration variable RAWMITFORMAT is set tn'i^f^ach word in a language model 
definition file consists of a white space delimited string ^^i^ no special processing being 
performed. 

• Source dictionaries read by HDMan are read using the stand^;^^ HTK string conventions, 
however, the command IR can be used in a HDMan source edit^f^^t to switch to using this 
raw format. 

• To ensure that the general definition of a name string works properly in(^rK master label files, 
all MLFs must have the reserved . and /// terminators alone on a line^with no surrounding 
white space. If this causes problems reading old MLF files, the configuratiovn^riable VICDMPAT 
should be set true in the module HLabel. In this case, HTK will atterm^o simulate the 
behaviour of the older version 1.5. 



alat 



• To force numbers to be interpreted as strings rather than times or scores in aiabel file, they 
must be quoted. If the configuration variable QUOTECHAR is set to ' or " then output labels 
will be quoted with the specified quote character. If QUOTECHAR is set to \, then output labels 
will be escaped. The default is to select the simplest quoting mechanism. 

Note that under some versions of Unix HTK can support the 8-bit character sets used for the 
representation of various orthographies. In such cases the shell environment variable $LANG usually 
governs which ISO character set is in use. 
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Language modelling tools 

Although these string conventions are unnecessary in HLM, to maintain compatibility with HTKthe 
same conventions are used. However, a number of options are provided to allow a mix of escaped 
and unescaped text files to be handled. Word maps allow the type of escaping (HTK or none) 
to be defined in their headers. When a degenerate form of word map is used (i.e. a map with no 
header), the LWMap configuration variable INWMAPRAW may be set to true to disable HTK escaping. 
By default, HLM tools output word lists and maps in HTK escaped form. However, this can be 
overridden by setting the configuration variable OUTWMAPRAW to true. Similar conventions apply to 
class maps. A degenecAte class map can be read in raw mode by setting the LClass configuration 
variable INCMAPRAW tcr^e, and a class map can be written in raw form by setting OUTCMAPRAW to 
true. 

Input/output of N-gr^^iiilanguage model files are handled by the HLM module LModel. Hence, 
by default input/output of^LMs stored in the ARPA-MIT text format will assume HTK escaping 
conventions. This can be dis^il^ed for both input and output by setting RAWMITFORMAT to true. 

4.7 Memory Mana^ment 

Memory management is a very lowv^vel function and is mostly invisible to HTK users. However, 
some applications require very largev^smounts of memory. For example, building the models for 
a large vocabulary continuous speech^)ctation system might require 150MB or more. Clearly, 
when memory demands become this larg^a proper understanding of the impact of system design 
decisions on memory usage is important. Tjiff first step in this is to have a basic understanding of 
memory allocation in HTK. 

Many HTK tools dynamically construct Isrree and complex data structures in memory. To keep 
strict control over this and to reduce memory \aJtocation overheads to an absolute minimum, HTK 
performs its own memory management. Thus, e^^y time that a module or tool wishes to allocate 
some memory, it does so by calling routines in HMriM At a slightly higher level, math objects such 
as vectors and matrices are allocated by HMath but^sing the primitives provided by HMem. 

To make memory allocation and de-allocation very fiast, tools create specific memory allocators 
for specific objects or groups of objects. These memoj^ allocators are divided into a sequence of 
blocks, and they are organised as either Stacks, M-heaps^' C-heaps. A Stack constrains the pattern 
of allocation and de-allocation requests to be made in a fS^t-allocated first-deallocated order but 
allows objects of any size to be allocated. An M-heap allows an arbitrary pattern of allocation 
and de-allocation requests to be made but all allocated objept^ must be the same size. Both of 
these memory allocation disciplines are more restricted thanHne vgeneral mechanism supplied by 
the operating system, and as a result, such memory operatio^^are faster and incur no storage 
overhead due to the need to maintain hidden housekeeping infoi^^ition in each allocated object. 
Finally, a C-heap uses the underlying operating system and allowiQfbitrary allocation patterns, 
and as a result incurs the associated time and space overheads. ^lie\use of C-heaps is avoided 
wherever possible. \y\ 

Most tools provide one or more trace options which show how muclriWmory has been allocated. 
The following shows the form of the output C3 

-6 



Heap Statistics 

nblk=l, siz= 100000*1, used= 32056, alloc= 100000 
nblk=l, siz= 200*28, used= 100, alloc= 5600 
nblk=l, siz= 10000*1, used= 3450, alloc= 10000 
nblk=2, siz= 7504*1, used= 9216, alloc= 10346 



Global StaSkCS] 

cellHeapM 

mlf Heap [S] Q 

nameHeap[S] 



Each line describes the status of each memory allocator and gives the number of blocks allocated, 
the current block size (number of elements in block x the number of bytes in each element)^, 
the total number of bytes in use by the tool and the total number of bytes currently allocated to 
that allocator. The end of each line gives the name of the allocator and its type: Stack[S], M- 
heap[M] or C-heap[M]. The element size for Stacks will always be 1 but will be variable in M-heaps. 
The documentation for the memory intensive HTK tools indicates what each of the main memory 
allocators are used for and this information allows the effects of various system design choices to 
be monitored. 



Block sizes typically grow as more blocks are allocated 
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4.8 Input/Output via Pipes and Networks 

Most types of file in HTK can be input or output via a pipe instead of directly from or to disk. The 
mechanism for doing this is to assign the required input or output filter command to a configuration 
parameter or to an environment variable, either can be used. Within this command, any occurrence 
of the dollar symbol $ will be replaced by the name of the required file. The output of the command 
will then be input to or output from the HTK tool via a pipe. 

For example, the following command will normally list the contents of the speech waveform file 
spfile 

HList spfile 

However, if the value of til£ environment variable HWAVEFILTER is set as follows 



setenv HWAVEFILTER •'gunzip -c $' 

then the effect is to invoke ^^^-.^ecompression filter gunzip with its input connected to the file 
spfile and its output connect^ to HList via a pipe. Each different type of file has a unique 
associated variable so that multi*^ mput and/or filters can be used. The full list of these is given 
in the summary section at the end<^this chapter. 

HTK is often used to process larg^mounts of data and typically this data is distributed across 
a network. In many systems, an atcefflct to open a file can fail because of temporary network 
glitches. In the majority of cases, a seconp>or third attempt to open the file a few seconds later will 
succeed and all will be well. To allow thisyfeo be done automatically, HTK tools can be configured to 
retry opening a file several times before gi"^^g up. This is done simply by setting the configuration 
parameter MAXTRYOPEN to the required numte^ of retries"^. 

4.9 Byte-swapping of HTK (Cata files 

Virtually all HTK tools can read and write data to^^d from binary files. The use of binary for- 
mat as opposed to text can speed up the performance* of the tools and at the same time reduce 
the file size when manipulating large quantities of dHteav- Typical binary files used by the HTK 
tools are speech waveform/parameter files, binary ma^r model files (MMF), binary accumula- 
tor files used in HMM parameter estimation and binary (^tice files. However, the use of binary 
data format often introduces incompatibilities between different machine architectures due to the 
different byte ordering conventions used to represent numerical quantities. In such cases, byte 
swapping of the data is required. To avoid incompatibilities across^different machine architectures, 
all HTK binary data files are written out using big-endian (NO^AX) representation of numerical 
values. Similarly, during loading HTK binary format files are assil^ted to be in NONVAX byte order. 
The default behavior can be altered using the configuration parai@!^ers NATURALREADORDER and 
NATURALWRITEDRDER. Setting NATURALREADORDER to true will instrScTf^e HTK tools to interpret 
the binary input data in the machine's natural byte order (byte swajmiiig will never take place). 
Similarly, setting NATURALWRITEORDER to true will instruct the tools Tivwrite out data using the 
machine's natural byte order. The default value of these two configuraticGXariables is false which 
is the appropriate setting when using HTK in a multiple machine architect^r^ environment. In an 
environment comprising entirely of machines with VAX byte order both con^guration parameters 
can be set true which will disable the byte swapping procedure during reading writing of data. 

4.10 Summary 

This section summarises the globally-used environment variables and configuration parameters. It 
also provides a list of all the standard command line options used with HTK. 

Table 4.1 lists all of the configuration parameters along with a brief description. A missing 
module name means that it is recognised by more than one module. Table 4.2 lists all of the 
environment parameters used by these modules. Finally, table 4.3 lists all of the standard options. 



This does not work if input filters are used. 
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Module 



Name 



Description 



HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 

HShell 

HMem 



ABDRTONERR 
HWAVEFILTER 
HPARMFILTER 
HLANGMDDFILTER 
HMMLISTFILTER 
HMMDEFFILTER 
HLABELFILTER 
HNETFILTER 
i^ICTFILTER 
^RAMFILTER 
LWpPFILTER 
LCM^FILTER 
LMTBKTFILTER 
HWAVEIJ^LTER 
HPARMORS^ER 
HLANGMDl!5M''iLTER 
HMMLISTOF^JER 
HMMDEFOFIL^ 
HLABELOFILT^ 
HNETOFILTER (\) 



LWMap 
LWMap 
LCMap 
LCMap 



HDICTOFILTER 
LGRAMOFILTER 
LWMAPOFILTER 
LCMAPOFILTER 
MAXTRYOPEN 
NONUMESCAPES 
NATURALREADORDER 

NATURALWRITEDRDER 

PROTECTSTAKS 

TRACE 

STARTWORD 

ENDWORD 

UNKNOWNNAME 

RAWMITFDRMAT 

INWMAPRAW 

OUTWMAPRAW 

INCMAPRAW 

OUTCMAPRAW 



Core dump on error (for debugging) 
Filter for waveform file input 
Filter for parameter file input 
Filter for language model file input 
Filter for HMM list file input 
Filter for HMM definition file input 
Filter for Label file input 
Filter for Network file input 
Filter for Dictionary file input 
Filter for gram file input 
Filter for word map file input 
Filter for class map file input 
Filter for text file input 
Filter for waveform file output 
Filter for parameter file output 
Filter for language model file output 
Filter for HMM list file output 
Filter for HMM definition file output 
Filter for Label file output 
Filter for Network file output 
Filter for Dictionary file output 

liter for gram file output 
Klter for word map file output 
f«ter for class map file output 
]>^!^ber of file open retries 
Pre«'B\it string output using \012 format 
Enaqleyatural read order for HTK binary 
files ^ J(' 

Enable Niajural write order for HTK bi- 
nary filesC^ 

Warn if stasis cut-back (debugging) 
Trace controWHefault=0) 
Set sentence start symbol (<s>) 
Set sentence encisjAnbol (</s>) 
Set GOV class sjQbo^ ( ! !UNK) 
Disable HTK esca^i^ for LM tools 
Disable HTK escapin(|^r input word lists 
and maps 

Disable HTK escapii^g'tor output word 
lists and maps ^ i 

Disable HTK escaping foKsaput class lists 
and maps Q 
Disable HTK escaping for oye'^^ut class 
lists and maps 



Table. 4.1 Configuration Parameters used in Operating En-((^onment 

% 



Env Variable 


Meaning 


HCONFIG 
HxxxFILTER 


Name of default configuration file 
Input/Output filters as above 



Table. 4.2 Environment Variables used in Operating Environment 



4.10 Summary 
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Standard Option Meaning 



-A 






-B 






-c 


cf 




-D 






-E 


dir 


[ext] 


-F 


fmt 




-G 


fmt 




-H 


mmf 




-I 


mlf 




-J 


dir 


[e^] 


-K 


dir 


[63^ 


-L 


dir 




-M 


dir 




-0 


fmt 




-P 


fmt 




-Q 






-S 


scp 




-T 


N 




-V 






-X 


ext 





Print command line arguments 
Store output HMM macro files in binary 
Configuration file is cf 
Display configuration variables 
Search for parent transform macros in directory dir 
Set source data file format to fmt 
Set source label file format to fmt 
Load HMM macro file mmf 
Load master label file mlf 
Search for transform macros in directory dir 
Save transform models in directory dir 
Look for label files in directory dir 
Store output HMM macro files in directory dir 
•^Ji^et output data file format to fmt 
output label file format to fmt 
Kont command summary info 
USe^mmand line script file scp 
Set ,jbrace level to N 
PrinlJ^^rsion information 
Set lab^)file extension to ext 



Table. 4.3 Sitmmary of Standard Options 
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•1^ 

o 

o 



o 
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Chapter 5 



Speech %iput / Output 



Many tools need to input pasf^ffijieterised speech data and HTK provides a number of different 
methods for doing this: r^^ 

• input from a previously encdaejd speech parameter file 



input from a waveform file whiwjs encoded as part of the input processing 



• input from an audio device which 4«^ncoded as part of the input processing. 

For input from a waveform file, a large riytnber of different file formats are supported, including 
all of the commonly used CD-ROM forma*^^ Input/output for parameter files is limited to the 
standard HTK file format and the new EntrooiC^Esignal format. 

Latti^sf 
Q^pSfraint 

, Dictionary 



Language 
Models Naiwori 



It 




HAudio 
HWave 
HParm 
HVQ 


HLabel 


HLM 


HNet 


HDict ^ 


HTK Tool 


HModel 
HUtil 


HSigP 


HShell 


HMem 


HGraf 


HMath 


HTrain 


HFB 


HAdapt 


HRec 




All HTK speech input is controlled by configuration parameters wl^icji give details of what 
processing operations to apply to each input speech file or audio source. /^5iis chapter describes 
speech input/output in HTK. The general mechanisms are explained and tnfe various configuration 
parameters are defined. The facilities for signal pre-processing, linear prediction4)ased processing, 
Fourier-based processing and vector quantisation are presented and the supported file formats are 
given. Also described are the facilities for augmenting the basic speech parameters^^th energy mea- 
sures, delta coefficients and acceleration (delta-delta) coefficients and for splitting'f^3j;h parameter 
vector into multiple data streams to form observations. The chapter concludes with a brief descrip- 
tion of the tools HLiST and HCopy which are provided for viewing, manipulating and encoding 
speech files. 



5.1 General Mechanism 

The facilities for speech input and output in HTK are provided by five distinct modules: HAudio, 
HWave, HParm, HVQ and HSigP. The interconnections between these modules are shown in 
Fig. 5.1. 



59 



5.1 General Mechanism 



60 



Parameter 
File 



Waveform 

File 
I 



Audio 
Input 



HWave 



HAudio 



1 A f" 











HSigP 




HParm 




HVQ 



t 

. Observations 

(Parameter Vectors and/or VQ Symbols) 

FigXfpil Speech Input Subsystem 

CO 

Waveforms are read from files usin^iJWAVE, or are input direct from an audio device using 
HAudio. In a few rare cases, such as mthp display tool HSLab, only the speech waveform is 
needed. However, in most cases the wave'^m is wanted in parameterised form and the required 
encoding is performed by HParm using gnal processing operations defined in HSigP. The 

parameter vectors are output by HParm in t^ijyform of observations which are the basic units of 
data processed by the HTK recognition and traini^^^ tools. An observation contains all components 
of a raw parameter vector but it may be possibly split into a number of independent parts. Each 
such part is regarded by a HTK tool as a statisticall^iotlependent data stream. Also, an observation 
may include VQ indices attached to each data stream^Alternatively, VQ indices can be read directly 
from a parameter file in which case the observation w(tfcontain only VQ indices. 

Usually a HTK tool will require a number of speechxi^ta files to be specified on the command 
line. In the majority of cases, these files will be required m^^earameterised form. Thus, the following 
example invokes the HTK embedded training tool HERbst to re-estimate a set of models using 
the speech data files si, s2, s3, .... These are input via the'iibr^iry module HParm and they must 
be in exactly the form needed by the models. \_) 



HERest 



si s2 s3 s4 



.0, 



required form, it will often 
To do this, configuration 



However, if the external form of the speech data files is not in 
be possible to convert them automatically during the input prob^uu._ 
parameter values are specified whose function is to define exactly h)wf^he conversion should be 
done. The key idea is that there is a source parameter kind and targeKpoyameter kind. The source 
refers to the natural form of the data in the external medium and the ta^^t refers to the form of 
the data that is required internally by the HTK tool. The principle functi^ of the speech input 
subsystem is to convert the source parameter kind into the required target parameter kind. 

Parameter kinds consist of a base form to which one or more qualifiers mfly be attached where 
each qualifier consists of a single letter preceded by an underscore character. vSome examples of 
parameter kinds are (3 



WAVEFORM simple waveform 

LPC linear prediction coefficients 

LPC_D_E LPC with energy and delta coefficients 

MFCC_C compressed mel-cepstral coefficients 

The required source and target parameter kinds are specified using the configuration parameters 
SOURCEKIND and TARGETKIND. Thus, if the following configuration parameters were defined 



SOURCEKIND = WAVEFORM 
TARGETKIND = MFCC_E 
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then the speech input subsystem would expect each input file to contain a speech waveform and it 
would convert it to mel-frequency cepstral coefficients with log energy appended. 
The source need not be a waveform. For example, the configuration parameters 

SOURCEKIND = LPC 
TARGETKIND = LPREFC 

would be used to read in files containing linear prediction coefficients and convert them to reflection 
coefficients. 

For convenience, a;special parameter kind called ANON is provided. When the source is specified 
as ANON then the acttSWcind of the source is determined from the input file. When ANON is used 
in the target kind, thei^lx is assumed to be identical to the source. For example, the effect of the 
following configuration pi^^meters 

SOURCEKIND = ANON • 
TARGETKIND = ANON_D \^ 



would simply be to add delta cerefiicients to whatever the source form happened to be. The source 
and target parameter kinds defai^^o ANON to indicate that by default no input conversions are 
performed. Note, however, that wHfeji two or more files are listed on the command line, the meaning 
of ANON will not be re-interpreted fro^one file to the next. Thus, it is a general rule, that any tool 
reading multiple source speech files reaTOes that all the files have the same parameter kind. 

The conversions applied by HTK's mraat subsystem can be complex and may not always behave 
exactly as expected. There are two facili^s Jhat can be used to help check and debug the set-up 
of the speech i/o configuration parameter^ Firstly, the tool HLiST simply displays speech data 
by listing it on the terminal. However, sinc^f^LiST uses the speech input subsystem like all HTK 
tools, if a value for TARGETKIND is set, then i^^^U display the target form rather than the source 
form. This is the simplest way to check the forni^ the speech data that will actually be delivered 
to a HTK tool. HList is described in more detai-1 in section 5.15 below. 

Secondly, trace output can be generated froniwie,HPARM module by setting the TRACE con- 
figuration file parameter. This is a bit-string in wlijcn individual bits cover different parts of the 
conversion processing. The details are given in the re^^^ce section. 

To summarise, speech input in HTK is controlled h»;\configuration parameters. The key pa- 
rameters are SOURCEKIND and TARGETKIND which specify ^^e source and target parameter kinds. 
These determine the end-points of the required input conversion. However, to properly specify the 
detailed steps in between, more configuration parameters nAistbe defined. These are described in 
subsequent sections. \^ 

5.2 Speech Signal Processing o 

In this section, the basic mechanisms involved in transforming a speec^^^aveform into a sequence of 
parameter vectors will be described. Throughout this section, it is assma^ that the SOURCEKIND is 
WAVEFORM and that data is being read from a HTK format file via HWAVs^^Reading from different 
format files is described below in section 5.11. Much of the material in this' section also applies to 
data read direct from an audio device, the additional features needed to di^?with this latter case 
are described later in section 5.12. , 

The overall process is illustrated in Fig. 5.2 which shows the sampled wavefd^i^ being converted 
into a sequence of parameter blocks. In general, HTK regards both waveform fil^^and parameter 
files as being just sample sequences, the only difference being that in the former a^fc the samples 
are 2- byte integers and in the latter they are multi-component vectors. The sam{51e rate of the 
input waveform will normally be determined from the input file itself. However, it can be set 
explicitly using the configuration parameter SOURCERATE. The period between each parameter vector 
determines the output sample rate and it is set using the configuration parameter TARGETRATE. The 
segment of waveform used to determine each parameter vector is usually referred to as a window 
and its size is set by the configuration parameter WINDOWSIZE. Notice that the window size and 
frame rate are independent. Normally, the window size will be larger than the frame rate so that 
successive windows overlap as illustrated in Fig. 5.2. 

For example, a waveform sampled at 16kHz would be converted into 100 parameter vectors per 
second using a 25 msec window by setting the following configuration parameters. 
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SOURCERATE = 625 
TARGETRATE = 100000 
WINDOWSIZE = 250000 

Remember that all durations are specified in 100 nsec units^. 



Window Duration 



WINDOWSIZE 



SOURCERATE 



Frame Period 



TARGETRATE 



block block 
n n+1 




\ Parameter 
I Vector 
r Size 



5 ^Speech Encoding 

Independent of what parameter kind is required^here are some simple pre-processing operations 
that can be applied prior to performing the actual f»}§nal analysis. Firstly, the DC mean can be 
removed from the source waveform by setting the Bticfcan configuration parameter ZMEANSDURCE 
to true (i.e. T). This is useful when the original analogrtejdigital conversion has added a DC offset 
to the signal. It is applied to each window individuallyCgo that it can be used both when reading 
from a file and when using direct audio input ^. 

Secondly, it is common practice to pre-emphasise the sigr^al by applying the first order difference 
equation 

X . 

to the samples {s„, n = 1, iV} in each window. Here k is the pre-amphasis coefficient which should 
be in the range 0 < fc < 1. It is specified using the configuration parameter PREEMCOEF. Finally, 
it is usually beneficial to taper the samples in each window so thJi£>fflscontinuities at the window 
edges are attenuated. This is done by setting the Boolean configuratS;:&parameter USEHAMMING to 
true. This applies the following transformation to the samples {s„, n ^f^n^iV} in the window 



(5.1) 



0.54 -0.46 cos 



27r(n- 1) 
A^- 1 



o. 



(5.2) 



When both pre-emphasis and Hamming windowing are enabled, pre-emphasis VS^erformed first. 

In practice, all three of the above are usually applied. Hence, a configuratiOTy^e will typically 
contain the following 



ZMEANSDURCE 
USEHAMMING = 
PREEMCOEF = 



= T 

T 
0.97 



^ The somewhat bizarre choice of lOOnsec units originated in Version 1 of HTK when times were represented by 
integers and this unit was the best compromise between precision and range. Times are now represented by doubles 
and hence the constraints no longer apply. However, the need for backwards compatibility means that lOOnscc units 
have been retained. The names SOURCERATE and TARGETRATE are also non-ideal, SQURCEPERIOD and TARGETPERIOD 
would be better. 

^ This method of applying a zero mean is different to HTK Version 1.5 where the mean was calculated and 
subtracted from the whole speech file in one operation. The configuration variable VICOMPAT can be set to revert to 
this older behaviour. 
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Certain types of artificially generated waveform data can cause numerical overflows with some 
coding schemes. In such cases adding a small amount of random noise to the waveform data solves 
the problem. The noise is added to the samples using 

s'n = s„ + qRND{) (5.3) 

where RND() is a uniformly distributed random value over the interval [—1.0, +1.0) and q is the 
scaling factor. The amount of noise added to the data (g) is set with the configuration parameter 
ADDDITHER (default value 0.0). A positive value causes the noise signal added to be the same every 
time (ensuring that t^ same file always gives exactly the same results) . With a negative value the 
noise is random and tn^ame file may produce slightly different results in different trials. 

One problem that arise when processing speech waveform files obtained from external 
sources, such as databa^Aon CD-ROM, is that the byte-order may be different to that used 
by the machine on which IJTK is running. To deal with this problem, HWave can perform auto- 
matic byte-swapping in orde^^J)p preserve proper byte order. HTK assumes by default that speech 
waveform data is encoded as a^quence of 2-byte integers as is the case for most current speech 
databases'*. If the source format is known, then HWave will also make an assumption about the 
byte order used to create speech^^es>in that format. It then checks the byte order of the machine 
that it is running on and automaMSHly performs byte-swapping if the order is different. For un- 
known formats, proper byte order cai^^ ensured by setting the configuration parameter BYTEDRDER 
to VAX if the speech data was created oisr&, little-endian machine such as a VAX or an IBM PC, and 
to anything else (e.g. NONVAX) if the sbegch data was created on a big-endian machine such as a 
SUN, HP or Macintosh machine. ^ \ » 

The reading/writing of HTK format ^^veform files can be further controlled via the config- 
uration parameters NATURALREADORDER andf^TURALWRITEORDER. The effect and default settings 
of these parameters are described in section 4.0y Note that BYTEDRDER should not be used when 
NATURALREADORDER is set to true. Finally, note ihA^ HTK can also byte-swap parameterised files in 
a similar way provided that only the byte-order of ea^h 4 byte float requires inversion. 

5.3 Linear Prediction Analysis ^ 

In linear prediction (LP) the vocal tract tranAnCion is modelled by an all-pole filter 

with transfer function"' 

where p is the number of poles and = I. The filter coefiiciente^ai} are chosen to minimise the 
mean square filter prediction error summed over the analysis "vnndow. The HTK module HSigP 
uses the autocorrelation method to perform this optimisation as foHows. 

Given a window of speech samples {s„, n — 1, A^}, the first p -^^^Xerms of the autocorrelation 
sequence are calculated from tTi. 

N-i 

where i = 0,p. The filter coefficients are then computed recursively using a>set of auxiliary coeffi- 
cients {ki} which can be interpreted as the reflection coefficients of an equivalsnt acoustic tube and 
the prediction error E which is initially equal to tq. Let {fc^* and {a^* ^•*} H^^e reflection and 
filter coefficients for a filter of order i — I, then a filter of order i can be calculai^^in three steps. 
Firstly, a new set of reflection coefficients are calculated. 

k^^ = kf-'^ (5.6) 



for J = 1 , i — 1 and 

fcf)=<'r, + > ar^V,_, W£;^-^) (5.7) 




''Many of the more recent speech databases use compression. In these cases, the data may be regarded as being 
logically encoded as a sequence of 2-byte integers even if the actual storage uses a variable length encoding scheme. 

* Note that some textbooks define the denominator of equation 5.4 as 1 — 53^=1 that the filter coefficients 

are the negatives of those computed by HTK. 
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Secondly, the prediction energy is updated. 

= (i-fcf fcf (5.8) 

Finally, new filter coefficients are computed 

= 0^-'' - k^a!tlp (5.9) 

for j = 1,1 — 1 and 

^ af ^ (5.10) 

This process is repeate^^oni i — 1 through to the required filter order i = p. 

To effect the above tr^^formation, the target parameter kind must be set to either LPC to obtain 
the LP fflter parameters {nf} or LPREFC to obtain the reflection coefficients {fc^}. The required filter 
order must also be set usin^he configuration parameter LPCDRDER. Thus, for example, the following 
configuration settings would ^^duce a target parameterisation consisting of 12 reflection coefficients 
per vector. 

TARGETKIND = LPREFC 9^ > 
LPCDRDER =12 

An alternative LPC-based param^risation is obtained by setting the target kind to LPCEPSTRA 
to generate linear prediction cepstra. cepstrum of a signal is computed by taking a Fourier 

(or similar) transform of the log spectr\3^ In the case of linear prediction cepstra, the required 
spectrum is the linear prediction spectrumA^hich can be obtained from the Fourier transform of 
the filter coefficients. However, it can be shi>*:n that the required cepstra can be more efficiently 
computed using a simple recursion '''^5^ 

Cn = -0,1 ^^U-i)aiCn-i (5.11) 

The number of cepstra generated need not be the same^s the number of filter coefficients, hence it 
is set by a separate configuration parameter called NUPiSEBS. 

The principal advantage of cepstral coefficients is th&t they are generally decorrelated and this 
allows diagonal covariances to be used in the HMMs. Hc^jfever, one minor problem with them is 
that the higher order cepstra are numerically quite small agd this results in a very wide range of 
variances when going from the low to high cepstral coefficisjafs. HTK does not have a problem 
with this but for pragmatic reasons such as displaying model^ar^meters, flooring variances, etc., 
it is convenient to re-scale the cepstral coefficients to have siimlar magnitudes. This is done by 
setting the configuration parameter CEPLIFTER to some value L er the cepstra according to 

the following formula 

c'„ = + ^^^'^x) ^" ^^-^^^ 

As an example, the following configuration parameters would use a 14'|thjorder linear prediction 
analysis to generate 12 liftered LP cepstra per target vector 

TARGETKIND = LPCEPSTRA , 
LPCDRDER =14 
NUMCEPS =12 
CEPLIFTER = 22 

These are typical of the values needed to generate a good front-end parameterisation for a speech 
recogniser based on linear prediction. 

Finally, note that the conversions supported by HTK are not limited to the case where the source 
is a waveform. HTK can convert any LP-based parameter into any other LP-based parameter. 



5.4 Filterbank Analysis 

The human ear resolves frequencies non-linearly across the audio spectrum and empirical evidence 
suggests that designing a front-end to operate in a similar non-linear manner improves recogni- 
tion performance. A popular alternative to linear prediction based analysis is therefore filterbank 
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analysis since this provides a much more straightforward route to obtaining the desired non-hnear 
frequency resolution. However, filterbank amplitudes are highly correlated and hence, the use of 
a cepstral transformation in this case is virtually mandatory if the data is to be used in a HMM 
based recogniser with diagonal covariances. 

HTK provides a simple Fourier transform based filterbank designed to give approximately equal 
resolution on a mel-scale. Fig. 5.3 illustrates the general form of this filterbank. As can be seen, 
the filters used are triangular and they are equally spaced along the mel-scale which is defined by 

^ Mel(/)= 25951ogio(l + ^) (5.13) 

To implement this filte^giuk, the window of speech data is transformed using a Fourier transform 
and the magnitude is ta,]^^ The magnitude coefficients are then binned by correlating them with 
each triangular filter. Here binning means that each FFT magnitude coefficient is multiplied by 
the corresponding filter gafn ajid the results accumulated. Thus, each bin holds a weighted sum 
representing the spectral magmtiide in that filterbank channel. As an alternative, the Boolean 
configuration parameter USEPCJffiR can be set true to use the power rather than the magnitude of 
the Fourier transform in the bin^n^ process. 




' freq 

Energy in 
Each Band 



Fig. 5.3 Mel-Scale 



Bank 



Normally the triangular filters are spread over the whole^^^quency range from zero upto the 
Nyquist frequency. However, band-limiting is often useful to r^^pe^ unwanted frequencies or avoid 
allocating filters to frequency regions in which there is no useful sigaa^ energy. For filterbank analysis 
only, lower and upper frequency cut-offs can be set using the conf 
HIFREQ. For example. 



tion parameters LDFREQ and 



LOFREQ 
HIFREQ 



300 
3400 




might be used for processing telephone speech. When low and high pasS/<^t-offs are set in this 
way, the specified number of filterbank channels are distributed equally on the mel-scale across the 
resulting pass-band such that the lower cut-off of the first filter is at LOFREQ and the upper cut-off 
of the last filter is at HIFREQ. O 

If mel-scale filterbank parameters are required directly, then the target kinclCs)iould be set to 
MELSPEC. Alternatively, log filterbank parameters can be generated by setting thl^^rget kind to 
FBANK. 



5.5 Vocal Tract Length Normalisation 

A simple speaker normalisation technique can be implemented by modifying the filterbank analysis 
described in the previous section. Vocal tract length normalisation (VTLN) aims to compensate for 
the fact that speakers have vocal tracts of different sizes. VTLN can be implemented by warping 
the frequency axis in the filterbank analysis. In HTK simple linear frequency warping is supported. 
The warping factor a is controlled by the configuration variable WARPFREQ. Here values of a < 1.0 
correspond to a compression of the frequency axis. As the warping would lead to some filters 
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being placed outside the analysis frequency range, the simple linear warping function is modified 
at the upper and lower boundaries. The result is that the lower boundary frequency of the analysis 
(LOFREQ) and the upper boundary frequency (HIFREQ) are always mapped to themselves. The 
regions in which the warping function deviates from the linear warping with factor a are controlled 
with the two configuration variables (WARPLCUTDFF) and (WARPUCUTOFF) . Figure 5.4 shows the overall 
shape of the resulting piece-wise linear warping functions. 




orig 



Fig. 



5.4\PKequency Warping 



The warping factor a can for example be fc^&l using a search procedure that compares like- 
lihoods at different warping factors. A typical procedure would involve recognising an utterance 
with a = 1.0 and then performing forced alignmerft o^he hypothesis for all warping factors in the 
range 0.8 — 1.2. The factor that gives the highest lifelihood is selected as the final warping factor. 
Instead of estimating a separate warping factor for esfcji* utterance, large units can be used by for 
example estimating only one a per speaker. 

Vocal tract length normalisation can be applied in teffling as well as in training the acoustic 
models. 



5.6 Cepstral Features 

Most often, however, cepstral parameters are required and these aC^indicated by setting the target 
kind to MFCC standing for Mel-Frequency Cepstral Coefficients (MF^). These are calculated from 
the log filterbank amplitudes {rrij} using the Discrete Cosine Transfd^ 



N 



J^mjcos ( —0 -0.5) 



(5.14) 



where N is the number of filterbank channels set by the configuration parameter NUMCHANS. The 
required number of cepstral coefficients is set by NUMCEPS as in the linear predic^i^n case. Liftering 
can also be applied to MFCCs using the CEPLIFTER configuration parameter (seo'^uation 5.12). 

MFCCs are the parameterisation of choice for many speech recognition applicaMps. They give 
good discrimination and lend themselves to a number of manipulations. In partid^iar, the effect 
of inserting a transmission channel on the input speech is to multiply the speech spectrum by the 
channel transfer function. In the log cepstral domain, this multiplication becomes a simple addition 
which can be removed by subtracting the cepstral mean from all input vectors. In practice, of 
course, the mean has to be estimated over a limited amount of speech data so the subtraction will 
not be perfect. Nevertheless, this simple technique is very effective in practice where it compensates 
for long-term spectral effects such as those caused by different microphones and audio channels. To 
perform this so-called Cepstral Mean Normalisation (CMN) in HTK it is only necessary to add the 
_Z qualifier to the target parameter kind. The mean is estimated by computing the average of each 
cepstral parameter across each input speech file. Since this cannot be done with live audio, cepstral 
mean compensation is not supported for this case. 
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In addition to the mean normalisation the variance of the data can be normalised. For improved 
robustness both mean and variance of the data should be calculated on a larger units (e.g. on all 
the data from a speaker instead of just on a single utterance). To use speaker-/cluster-based 
normalisation the mean and variance estimates are computed offline before the actual recognition 
and stored in separate files (two files per cluster). The configuration variables CMEANDIR and 
VARSCALEDIR point to the directories where these files are stored. To find the actual filename 
a second set of variables (CMEANMASK and VARSCALEMASK) has to be specified. These masks are 
regular expressions in which you can use the special characters ?, * and 7o. The appropriate mask 
is matched against the filename of the file to be recognised and the substring that was matched 
against the 7, charact^;jis used as the filename of the normalisation file. An example config setting 

CMEANDIR = /data/^f&lOl/plp/cmn 
CMEANMASK = tlUT/Xm.* 
VARSCALEDIR = /data/evaiO^/plp/cvn 
VARSCALEMASK = VX/X/X/l'a^ 
VARSCALEFN = /data/evalOt^p/globvar 

So, if the file swl-493O-B_4930^-swl_OOO126_OOO439.plp is to be recognised then the nor- 
malisation estimates would be loade(^^'om the following files: 

/data/eval01/plp/cmn/swl-4930-B 
/data/eval01/plp/cvn/swl-4930-B O 

The file specified by VARSCALEFN contafns^he global target variance vector, i.e. the variance of 
the data is first normalised to 1.0 based on th(e^estimate in the appropriate file in VARSCALEDIR and 
then scaled to the target variance given in VAJi^ALEFN. 

The format of the files is very simple and ea(^^of them just contains one vector. Note that in 
the case of the cepstral mean only the static coeffijiiSnts will be normalised. A cmn file could for 
example look like: 

<CEPSNDRM> <PLP_0> 

<MEAN> 13 ^ 
-10.285290 -9.484871 -6.454639 ... ^ 

The cepstral variance normalised always applies to the fuil observation vector after all qualifiers 
like delta and acceleration coefficients have been added, e.g.: Y~) 

<CEPSNDRM> <PLP_D_A_Z_0> 

<VARIANCE> 39 O 
33.543018 31.241779 36.076199 ... 



The global variance vector will always have the same number of J^^sions as the cvn vector, 



e.g.: 

<VARSCALE> 39 
2.974308e+01 4.143743e+01 3.819999e+01 ... 



These estimates can be generated using HCompV. See the reference sectiol^^r details. 

5.7 Perceptual Linear Prediction v* 

An alternative to the Mel- Frequency Cepstral Coefficients is the use of Perceptual Linear Prediction 
(PLP) coefficients. 

As implemented in HTK the PLP feature extraction is based on the standard mel-frequency 
filterbank (possibly warped). The mel filterbank coefficients are weighted by an equal- loudness 
curve and then compressed by taking the cubic root.'' From the resulting auditory spectrum LP 
coefficents are estimated which are then converted to cepstral coefficents in the normal way (see 
above) . 



^the degree of compression can be controlled by setting the configuration parameter COMPRESSFACT which is the 
power to which the amplitudes are raised and defaults to 0.33) 
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5.8 Energy Measures 

To augment the spectral parameters derived from linear prediction or mel-filterbank analysis, an 
energy term can be appended by including the qualifier _E in the target kind. The energy is 
computed as the log of the signal energy, that is, for speech samples {s„, n = 1, N} 

N 

E^logY^sl (5.15) 

This log energy assure can be normalised to the range — i?„ji„..1.0 by setting the Boolean 
configuration paramet^^^NORMALISE to true (default setting). This normalisation is implemented 
by subtracting the maxirn^m value of E in the utterance and adding 1.0. Note that energy normal- 
isation is incompatible wirn live audio input and in such circumstances the configuration variable 
ENDRMALISE should be explicitly set false. The lowest energy in the utterance can be clamped using 
the configuration parameter '^^ELDOR which gives the ratio between the maximum and minimum 
energies in the utterance in dB^ts default value is 50dB. Finally, the overall log energy can be 
arbitrarily scaled by the value o£-tke configuration parameter ESC ALE whose default is 0.1. 

When calculating energy for^UEC-derived parameterisations, the default is to use the zero- 
th delay autocorrelation coefficientyroi) . However, this means that the energy is calculated after 
windowing and pre-emphasis. If the xjMifiguration parameter RAWENERGY is set true, however, then 
energy is calculated separately before ^ly windowing or pre-emphasis regardless of the requested 
parameterisation*' . \^ 

In addition to, or in place of, the log'^MPgy, the qualifier _0 can be added to a target kind to 
indicate that the O'th cepstral parameter S'ois to be appended. This qualifier is only valid if the 
target kind is MFCC. Unlike earlier versions of^TK scaling factors set by the configuration variable 
ESCALE are not apphed to Cq'. 

5.9 Delta, Acceleration and Differential Coefficients 

The performance of a speech recognition system can b^gjeatly enhanced by adding time derivatives 
to the basic static parameters. In HTK, these are inaie^!(ted by attaching qualifiers to the basic 
parameter kind. The qualifier _D indicates that first ordw: regression coefficients (referred to as 
delta coefllcients) are appended, the qualifier _A indicatesNtnat second order regression coefficients 
(referred to as acceleration coefficients) and the qualifier _¥ indicates that third order regression 
coefficients (referred to as third differential coefficients) are ctrojended. The _A qualifier cannot be 
used without also using the _D qualifier. Similarly the _T qualifier^nnot be used without also using 
the _D and _A qualifiers. 

The delta coefficients are computed using the following regressi^^ormula 

where dt is a delta coefficient at time t computed in terms of the correspO^j^ing static coefficients 
Ct_e to Ct+0- The value of O is set using the configuration parameter DKLTAWINDDW. The same 
formula is applied to the delta coefficients to obtain acceleration coefficientB except that in this 
case the window size is set by ACCWINDOW. Similarly the third differentials use '^fjRDWINDOW. Since 
equation 5.16 relies on past and future speech parameter values, some modificatio^Tp needed at the 
beginning and end of the speech. The default behaviour is to replicate the first ^**last vector as 
needed to fill the regression window. <^ 

In older version 1.5 of HTK and earlier, this end-effect problem was solved by using simple first 
order differences at the start and end of the speech, that is 

dt^ct+i-ct, t<e (5.17) 

and 

dt = ct-ct^i, t>T~e (5.18) 



® In any event, setting the compatibility variable VICOMPAT to true in HPARM will ensure that the calculation of 
energy is compatible with that computed by the Version 1 tool HCODE. 
Unless VICOMPAT is set to true. 
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where T is the length of the data file. If required, this older behaviour can be restored by setting 
the configuration variable VICOMPAT to true in HParm. 

For some purposes, it is useful to use simple differences throughout. This can be achieved by 
setting the configuration variable SIMPLEDIFFS to true in HParm. In this case, just the end-points 
of the delta window are used, i.e. 

= (5.19) 

When delta and a^gleration coefficients are requested, they are computed for all static param- 
eters including energy ^^resent. In some applications, the absolute energy is not useful but time 
derivatives of the energy ^ay be. By including the _E qualifier together with the _N qualifier, the 
absolute energy is suppre^^d leaving just the delta and acceleration coefficients of the energy. 



5.10 Storage of ^^^ameter Files 

Whereas HTK can handle wavef(?^ data in a variety of file formats, all parameterised speech data 
is stored externally in either nativevl^'K format data files or Entropic Esignal format files. Entropic 
ESPS format is no longer supported/erectly, but input and output filters can be used to convert 
ESPS to Esignal format on input and" feignal to ESPS on output. 

5.10.1 HTK Format Parame^Tv Files 

HTK format files consist of a contiguous seqtrance of samples preceded by a header. Each sample 
is a vector of either 2-byte integers or 4-byte'^»a,ts. 2-byte integers are used for compressed forms 
as described below and for vector quantised cmavas described later in section 5.14. HTK format 
data ffies can also be used to store speech waveforms as described in section 5.11. 
The HTK file format header is 12 bytes long a^i^ contains the following data 

nSamples - number of samples in integer) 

ScmipPeriod - sample period in 100ns units (-^fe^Ae integer) 

samipSize - number of bytes per sample {2-hne integer) 

parmKind - a code indicating the sample kind ^^byte integer) 

The parameter kind consists of a 6 bit code representing th* ba^ic parameter kind plus additional 
bits for each of the possible qualifiers. The basic parameter codes are 

0 WAVEFORM sampled waveform 

1 LPC linear prediction filter coefficients o 

2 LPREFC linear prediction reflection coefficients^^^ 

3 LPCEPSTRA LPC cepstral coefficients 

4 LPDELCEP LPC cepstra plus delta coefficients ^A. 

5 IREFC LPC reflection coef in 16 bit integer format^-^_ 

6 MFCC mel-frequency cepstral coefficients V^' 

7 FBANK log mel-filter bank channel outputs 

8 MELSPEC linear mel-fllter bank channel outputs , 

9 USER user deflned sample kind 

10 DISCRETE vector quantised data 

11 PLP PLP cepstral coefficients 



o 

% 



and the bit-encoding for the qualiflers (in octal) is 



_E 


000100 


has energy 


_N 


000200 


absolute energy suppressed 


_D 


000400 


has delta coefficients 


_A 


001000 


has acceleration coefficients 


_C 


002000 


is compressed 


_Z 


004000 


has zero mean static coef. 


_K 


010000 


has CRC checksum 


_0 


020000 


has O'th cepstral coef. 
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_V 040000 has VQ data 

_T 100000 has third diflterential coef. 



The _A quahfier can only be specified when _D is also specified. The _N qualifier is only valid 
when both energy and delta coeflicients are present. The sample kind LPDELCEP is identical to 
LPCEPSTRAJD and is retained for compatibility with older versions of HTK. The _C and _K only exist 
in external files. Compressed files are always decompressed on loading and any attached CRC is 
checked and removed. An external file can contain both an energy term and a O'th order cepstral 
coefficient. These ma^be retained on loading but normally one or the other is discarded^. 

\ 



LPC 



LPC E 



LPC D 



LPC_E_D 



LPC_E_D_N 



LPC EDA 



c 


... 






c 








E 











t-j Basic Coefficients 
E Log Energy 

dC dE Delta coefficients 
1' 

DC DE Acceleration coefficients 



dc. 



dC. 



dC 



dC 



dE 



dC 



dE 



dC^dE DC^D(^ 



DE 



Fig. 5.5 Parameter VectorTLayout in HTK Format Files 

All parameterised forms of HTK data files consist of a sequence of vectors. Each vector is 
organised as shown by the examples in Fig 5.5 where various different qualified forms are listed. As 
can be seen, an energy value if present immediately folfew^he base coefficients. If delta coefficients 
are added, these follow the base coefficients and energy Vakje. Note that the base form LPC is used 
in this figure only as an example, the same layout applies wyAU base sample kinds. If the O'th order 
cepstral coefficient is included as well as energy then it is inserted immediately before the energy 
coefficient, otherwise it replaces it. 

For external storage of speech parameter files, two compressipt^ methods are provided. For LP 
coding only, the IREFC parameter kind exploits the fact that the^reflection coefficients are bounded 
by ±1 and hence they can be stored as scaled integers such tnm +1.0 is stored as 32767 and 
— 1.0 is stored as —32767. For other types of parameterisation, a nj^^general compression facility 
indicated by the _C qualifier is used. HTK compressed parameter files'^^sist of a set of compressed 
parameter vectors stored as shorts such that for parameter x 



short 



A* X 



float 



B 



O 



6 



A — 2 * // {Xmax 



)*I/{x 

max 



0 



The coefficients A and B are defined as 

O 

,% 

where Xmax is the maximum value of parameter x in the whole file and Xmin is the corresponding 
minimum. / is the maximum range of a 2-byte integer i.e. 32767. The values of A and B are stored 
as two fioating point vectors prepended to the start of the file immediately after the header. 

When a HTK tool writes out a speech file to external storage, no further signal conversions are 
performed. Thus, for most purposes, the target parameter kind specifies both the required internal 
representation and the form of the written output, if any. However, there is a distinction in the 
way that the external data is actually stored. Firstly, it can be compressed as described above by 
setting the configuration parameter SAVECOMPRESSED to true. If the target kind is LPREFC then this 

* Some applications may require the O'th order cepstral coefficient in order to recover the filterbank coefficients 
from the cepstral coefficients. 
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compression is implemented by converting to IREFC otherwise the general compression algorithm 
described above is used. Secondly, in order to avoid data corruption problems, externally stored 
HTK parameter files can have a cyclic redundancy checksum appended. This is indicated by the 
qualifier _K and it is generated by setting the configuration parameter SAVEWITHCRC to true. The 
principle tool which uses these output conversions is HCopy (see section 5.16). 



5.10.2 Esignal Format Parameter Files 

The default for parameter files is native HTK format. However, HTK tools also support the Entropic 
Esignal format for bo^^nput and output. Esignal replaces the Entropic ESPS file format. To ensure 
compatibility Entropic ^ovides conversion programs from ESPS to ESIG and vice versa. 

To indicate that a smirce file is in Esignal format the configuration variable SOURCEFDRMAT 
should be set to ESIG. Mtfernatively, -F ESIG can be specified as a command-line option. To 
generate Esignal format omtput files, the configuration variable TARGETFORMAT should be set to 
ESIG or the command line oj^^n -0 ESIG should be set. 

ESIG files consist of three ^™-ts: a preamble, a sequence of field specifications called the field 
list and a sequence of records. Tija preamble and the field list together constitute the header. The 
preamble is purely ASCII. Currericui^t consists of 6 information items that are all terminated by a 
new line. The information in the pj^eamble is the following: 

line 1 - identification oL*te file format 

line 2 - version of the filejformat 

line 3 - architecture (ASCff, EDRl, EDR2, machine name) 

line 4 - preamble size (48 l^es) 

line 5 - total header size 

line 6 - record size r\ 

. 

All ESIG files that are output by HTK programs ^ehtain the following global fields: 



commandLine the command-line used to generate^the file; 
recordFreq a double value that indicates the sample j^e^^uency in Herz; 

startTime a double value that indicates a time at which t^)first sample is presumed to be starting; 

parmKind a character string that indicates the full type oT Diameters in the file, e.g: MFCC_E_D. 

source.l if the input file was an ESIG file this field includes thg^eader items in the input file. 

After that there are field specifiers for the records. The first sp^s)fier is for the basekind of the 
parameters, e.g: MFCC. Then for each available qualifier there ar^^^ditional specifiers. Possible 
specifiers are: 

zeroc 

energy Q 
delta 

delta_zeroc 

delta_energy * 
aces V 
accs_zeroc (3 
accs_energy 



The data segments of the ESIG files have exactly the same format as the the corresponding HTK 
files. This format was described in the previous section. 

HTK can only input parameter files that have a valid parameter kind as value of the header field 
parmKind. If this field does not exist or if the value of this field does not contain a valid parameter 
kind, the file is rejected. After the header has been read the file is treated as an HTK file. 
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5.11 Waveform File Formats 

For reading waveform data files, HTK can support a variety of different formats and these are all 
briefly described in this section. The default speech file format is HTK. If a different format is to 
be used, it can be specified by setting the configuration parameter SDURCEFORMAT. However, since 
file formats need to be changed often, they can also be set individually via the -F command-line 
option. This over-rides any setting of the SDURCEFORMAT configuration parameter. 

Similarly for the output of waveforms, the format can be set using either the configuration 
parameter TARGETFDRMAT or the -0 command-line option. However, for output only native HTK 
format (HTK), Esignai^rmat (ESIG) and header less (NOHEAD) waveform files arc supported. 

The following sub-scions give a brief description of each of the waveform file formats supported 
by HTK. 

5.11.1 HTK File Fftrmat 

The HTK file format for wav^>*^s is identical to that described in section 5.10 above. It consists 
of a 12 byte header followed bjtci ^equence of 2 byte integer speech samples. For waveforms, the 
sampSize field will be 2 and the^^rjnKind field will be 0. The sampPeriod field gives the sample 
period in 100ns units, hence for e'^Qnple, it will have the value 1000 for speech files sampled at 
lOkHz and 625 for speech files samp]^ at 16kHz. 

CO 

5.11.2 Esignal File Format 

The Esignal file format for waveforms is ^l^ilar to that described in section 5.10 above with the 
following exceptions. When reading an ESIQ^aveform file the HTK programs only check whether 
the record length equals 2 and whether the daffltype of the only field in the data records is SHORT. 
The data field that is created on output of a wa^^^rm is called WAVEFORM. 

5.11.3 TIMIT File Format 

The TIMIT format has the same structure as the IJ^K format except that the 12-byte header 
contains the following 

hdrSize - number of bytes in header ie 12 (2-(^te integer) 

version - version number (2-byte integer) ^ 

numChannels - number of channels (2-byte integer) 

sampRate - sample rate (2-byte integer) \ 

nSamples - number of samples in file (4- byte integer^ 

TIMIT format data is used only on the prototype TIMIT CD RO^^ 

5.11.4 NIST File Format 

The NIST file format is also referred to as the Sphere file format. A NIST(^ader consists of ASCII 
text. It begins with a label of the form NISTxx where xx is a version code ^cS^iowed by the number 
of bytes in the header. The remainder of the header consists of name value pairs of which HTK 
decodes the following 

Scmiple_rate - sample rate in Hz 

samplejn_bytes - number of bytes in each sample 

sample_count - number of samples in file 

sample_byte_f ormat - byte order 

Scmiple_coding - speech coding eg pcm, /ilaw, shortpack 

chEinnels_interleaved- for 2 channel data only 

The current NIST Sphere data format subsumes a variety of internal data organisations. HTK cur- 
rently supports interleaved /ilaw used in Switchboard, Shortpack compression used in the original 
version of WSJO and standard 16bit linear PCM as used in Resource Management, TIMIT, etc. 
It does not currently support the Shorten compression format as used in WSJl due to licensing 
restrictions. Hence, to read WSJl, the files must be converted using the NIST supplied decom- 
pression routines into standard 16 bit linear PCM. This is most conveniently done under UNIX by 
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using the decompression program as an input filter set via the environment variable HWAVEFILTER 
(see section 4.8). 

For interleaved /ilaw as used in Switchboard, the default is to add the two channels together. 
The left channel only can be obtained by setting the environment variable STEREDMODE to LEFT and 
the right channel only can be obtained by setting the environment variable STEREOMODE to RIGHT. 



5.11.5 SCRIBE File Format 

The SCRIBE format is a subset of the standard laid down by the European Esprit Programme SAM 
Project. SCRIBE dat^^les are headerless and therefore consist of just a sequence of 16 bit sample 
values. HTK assumes^^ default that the sample rate is 20kHz. The configuration parameter 
SOURCERATE should be s^to over-ride this. The byte ordering assumed for SCRIBE data files is 
VAX (httle-endian). ^ 

• 

5.11.6 SDESl File I^mat 



The SDESl format refers to th\E "Sound Designer I" format defined by Digidesign Inc in 1985 for 
multimedia and general audo appiica^tions. It is used for storing short monoaural sound samples. 
The SDESl header is complex (133^3ytes) since it allows for associated display window information 
to be stored in it as well as providingf^cilities for specifying repeat loops. The HTK input routine 
for this format just picks out the foUo^^J^g information 

headerSize - size of header ie lf^6 (2 byte integer) 
(182 byte filler) y>' 

f ileSize - number of bytes of s^*tipled data (4 byte integer) 
(832 byte filler) '<\^ 

sampRate - sample rate in Hz (4 bjrfce integer) 

ScunpPeriod ~ sample period in microsesonds (4 byte integer) 

SEonpSize - number of bits per sample y^l6 (2 byte inte ger 

5.11.7 AIFF File Format 



The AIFF format was defined by Apple Computer for stsring monoaural and multichannel sampled 
sounds. An AIFF file consists of a number of chunks, mon chunk contains the fundamental 

parameters of the sound (sample rate, number of channels, ^tc) and a Sound Data chunk contains 
sampled audio data. HTK only partially supports AIFF sigi«5e some of the information in it is 
stored as floating point numbers. In particular, the sample~rat^ is stored in this form and to 
avoid portability problems, HTK ignores the given sample ra^e and assumes that it is 16kHz. 
If this default rate is incorrect, then the true sample period sl^j)ld be specified by setting the 
SOURCERATE configuration parameter. Full details of the AIFF f^^^it are available from Apple 
Developer Technical Support. 

5.11.8 SUNAU8 File Format q 

The SUNAU8 format defines a subset of the ".au" and ".snd" audio file fo^^at used by Sun and 
NeXT. An SUNAU8 speech data file consists of a header followed by 8 bit ^law encoded speech 
samples. The header is 28 bytes and contains the following fields, each of whicj^-i| 4 bytes 



magicNumber - magic number 0x2e736e64 



dataLocation - offset to start of data 

dataSize - number of bytes of data 

dataFormat - data format code which is 1 for 8 bit /ilaw 

ScunpRate - a sample rate code which is always 8012.821 Hz 

numChcin - the number of channels 

info - arbitrary character string min length 4 bytes 

No default byte ordering is assumed for this format. If the data source is known to be different to 
the machine being used, then the environment variable BYTEORDER must be set appropriately. Note 
that when used on Sun Sparc machines with 16 bit audio device the sampling rate of 8012.821Hz 
is not supported and playback will be peformed at 8KHz. 
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5.11.9 OGI File Format 

The OGI format is similar to TIMIT. The header contains the following 

hdrSize - number of bytes in header 

version - version number (2-byte integer) 

nmnChajinels ~ number of channels (2-byte integer) 

ScunpRate - sample rate (2-byte integer) 

nSamples ~ number of samples in file (4- byte integer) 

lendian ^used to test for byte swapping (4- byte integer) 



5.11.10 WAV Fi^Format 

The WAV file format is a subset of Microsoft's RIFF specification for the storage of multimedia 
files. A RIFF file starts out* with a file header followed by a sequence of data "chunks" . A WAV file 
is often just a RIFF file witlkVsjngle "WAVE" chunk which consists of two sub-chunks - a "fmt" 
chunk specifying the data forni^^nd a "data" chunk containing the actual sample data. The WAV 
file header contains the following^ 

'RIFF' - RIFF file id^ fication (4 bytes) 

<leiigth> - length field (4j^tes) 

' WAVE ' - WAVE chunk i(^tification (4 bytes) 

' fmt ' - format sub-chunk-laentification (4 bytes) 

f length - length of format si!jj>)chunk (4 byte integer) 

format - format specifier (2 byja integer) 

Chans - number of channels (2ft)yte integer) 

samipsRate - sample rate in Hz (4 b;^l^integer) 

bpsec - bytes per second (4 byte ^n^eger) 

bpsample - bytes per sample (2 byte ii^t^er) 

bpchan - bits per channel (2 byte intege^^ 

'data' - data sub-chunk identification\4^ytes) 

dlength - length of data sub-chunk (4 byV^nteger) 



Support is provided for 8-bit CCITT mu-law, 8-bit CCWT a-law, 8-bit PCM linear and 16-bit 
PCM hnear - all in stereo or mono (use of STEREDMODE paJ-ameter as per NIST). The default byte 
ordering assumed for WAV data files is VAX (little-endian). *v^^ 

5.11.11 ALIEN and NOHEAD File Formats 

HTK tools can read speech waveform files with alien formats provided that their overall structure 
is that of a header followed by data. This is done by setting the ^Mmat to ALIEN and setting the 
environment variable HEADERSIZE to the number of bytes in the heasfr. HTK will then attempt 
to infer the rest of the information it needs. However, if input is fr8(f^3^ pipe, then the number 
of samples expected must be set using the environment variable NSAMPLE^~^he sample rate of the 
source file is defined by the configuration parameter SOURCERATE as descrifi^ed in section 5.2. If 
the file has no header then the format NOHEAD may be specified instead of ALIEN in which case 
HEADERSIZE is assumed to be zero. • 

o 

5.12 Direct Audio Input /Output 

Many HTK tools, particularly recognition tools, can input speech waveform data directly from an 
audio device. The basic mechanism for doing this is to simply specify the SOURCEKIND as being 
HAUDIO following which speech samples will be read directly from the host computer's audio input 
device. 

Note that for live audio input, the configuration variable ENORMALISE should be set to false both 
during training and recognition. Energy normalisation cannot be used with live audio input, and 
the default setting for this variable is TRUE. When training models for live audio input, be sure to 
set ENORMALISE to false. If you have existing models trained with ENORMALISE set to true, you can 
retrain them using single-pass retraining (see section 8.6). 
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When using direct audio input, the input samphng rate may be set exphcitly using the config- 
uration parameter SOURCERATE, otherwise HTK will assume that it has been set by some external 
means such as an audio control panel. In the latter case, it must be possible for HAuDiO to obtain 
the sample rate from the audio driver otherwise an error message will be generated. 

Although the detailed control of audio hardware is typically machine dependent, HTK provides 
a number of Boolean configuration variables to request specific input and output sources. These 
are indicated by the following table 





Variable 


Source/ Sink 




LINEIN 


line input 


MICIN 
LINEOUT 


microphone input 
line output 


PHONESOUT 


headphones output 


• 


SPEAKEROUT 


speaker output 



The major complication iWij^ng direct audio is in starting and stopping the input device. The 
simplest approach to this is for^^KCK tools to take direct control and, for example, enable the audio 
input for a fixed period determi^^ via a command line option. However, the HAudio/HParm 
modules provides two more power^nKbuilt-in facilities for audio input control. 

The first method of audio input drartrol involves the use of an automatic energy-based speech/silence 
detector which is enabled by setting rn^onfiguration parameter USESILDET to true. Note that the 
speech/silence detector can also operaiexm waveform input files. 

The automatic speech / silence detectd^^ses a two level algorithm which first classifies each frame 
of data as either speech or silence and th^a^applies a heuristic to determine the start and end of 
each utterance. The detector classifies ea(4*^rame as speech or silence based solely on the log 
energy of the signal. When the energy value exceeds a threshold the frame is marked as speech 
otherwise as silence. The threshold is made «p of two components both of which can be set by 
configuration variables. The first component represents the mean energy level of silence and can 
be set explicitly via the configuration parameter ^^^ENERGY. However, it is more usual to take a 
measurement from the environment directly. Settir^l^he configuration parameter MEASURESIL to 
true will cause the detector to calibrate its parameterpfrom the current acoustic environment just 
prior to sampling. The second threshold component isn^>ieve\ above which frames are classified as 
speech (SPEECHTHRESH) . Once each frame has been clarified as speech or silence they are grouped 
into windows consisting of SPCSEQCOUNT consecutive fran(|s). When the number of frames marked 
as silence within each window falls below a glitch count the ^^hole window is classed as speech. Two 
separate glitch counts are used, SPCGLCHCOUNT before speechN^rJ^et is detected and SILGLCHCOUNT 
whilst searching for the end of the utterance. This allows tne algorithm to take account of the 
tendancy for the end of an utterance to be somewhat quieter tfl;Mi the beginning. Finally, a top 
level heuristic is used to determine the start and end of the uttecaiice. The heuristic defines the 
start of speech as the beginning of the first window classified as ^^kch. The actual start of the 
processed utterance is SILMARGIN frames before the detected start o^^eech to ensure that when 
the speech detector triggers shghtly late the recognition accuracy is no^^^ffected. Once the start 
of the utterance has been found the detector searches for SILSEQCOUNVWndows all classified as 
silence and sets the end of speech to be the end of the last window classifred' as speech. Once again 
the processed utterance is extended SILMARGIN frames to ensure that if t^e silence detector has 
triggered slightly early the whole of the speech is still available for further prgicessing. 

o 

% 
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Processed Utterance 



Signal 



Frame 
Energy 



Window 
classification 




SILSEQCOUNT 




6 Endpointer Parameters 



Fig 5.6 shows an example of the sf(.Qech/silence detection process. The waveform data is first 
classified as speech or silence at frame al^)then at window level before finally the start and end of 
the utterance are marked. In the example, ^ATio input starts at point A and is stopped automatically 
at point H. The start of speech, C, occurs w^ieta a window of SPCSEQCDUNT frames are classified as 
speech and the start of the utterance occurs'^SILMARGIN frames earlier at B. The period of silence 
from D to E is not marked as the end of thesifterance because it is shorter than SILSEQCOUNT. 
However after point F no more windows are clasi^ffifed as speech (although a few frames are) and so 
this is marked as the end of speech with the end oi^Iie utterance extended to G. 

The second built-in mechanism for controlling aud^^ input is by arranging for a signal to be sent 
from some other process. Sending the signal for the firs* time starts the audio device. If the speech 
detector is not enabled then sampling starts immediSrfeJ^ and is stopped by sending the signal 
a second time. If automatic speech/silence detection is enabled, then the first signal starts the 
detector. Sampling stops immediately when a second sign^^s received or when silence is detected. 
The signal number is set using the configuration paramete^^ AUDI OS IG. Keypress control operates 
in a similar fashion and is enabled by setting the configuratioff^arameter AUDIOSIG to a negative 
number. In this mode an initial keypress will be required to stcTrt aampling/speech detection and a 
second keypress will stop sampling immediately. \ 

Audio output is also supported by HTK. There are no generic f^E^ities for output and the precise 
behaviour will depend on the tool used. It should be noted, howev^^^at the audio input facilities 
provided by HAuDiO include provision for attaching a replay buffer to^^ audio input channel. This 
is typically used to store the last few seconds of each input to a recogmj^^ tool in a circular buffer 



so that the last utterance input can be replayed on demand. 



O 



5.13 Multiple Input Streams 

As noted in section 5.1, HTK tools regard the input observation sequence as divided into a 

number of independent data streams. For building continuous density HMM sys^e^s, this facility 
is of limited use and by far the most common case is that of a single data stream.v^towever, when 
building tied-mixture systems or when using vector quantisation, a more uniform (*overage of the 
acoustic space is obtained by separating energy, deltas, etc., into separate streams. 

This separation of parameter vectors into streams takes place at the point where the vectors 
are extracted from the converted input file or audio device and transformed into an observation. 
The tools for HMM construction and for recognition thus view the input data as a sequence of 
observations but note that this is entirely internal to HTK. Externally data is always stored as a 
single sequence of parameter vectors. 

When multiple streams are required, the division of the parameter vectors is performed auto- 
matically based on the parameter kind. This works according to the following rules. 



1 stream single parameter vector. This is the default case. 
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2 streams if the parameter vector contains energy terms, then they are extracted and placed in 

stream 2. Stream 1 contains the remaining static coefficients and their deltas and accelera- 
tions, if any. Otherwise, the parameter vector must have appended delta coefficients and no 
appended acceleration coefficients. The vector is then split so that the static coefficients form 
stream 1 and the corresponding delta coefficients form stream 2. 

3 streams if the parameter vector has acceleration coefficients, then vector is split with static 

coefficients plus any energy in stream 1, delta coefficients plus any delta energy in stream 2 and 
acceleration coefficients plus any acceleration energy in stream 3. Otherwise, the parameter 
vector must inc^de log energy and must have appended delta coefficients. The vector is then 
split into three pMts so that the static coefficients form stream 1, the delta coefficients form 
stream 2, and theAdk energy and delta log energy are combined to form stream 3. 

4 streams the parameter vector must include log energy and must have appended delta and accel- 

eration coefficients, /ector is split into 4 parts so that the static coefficients form stream 
1, the delta coefficients^rra stream 2, the acceleration coefficients form stream 3 and the log 
energy, delta energy and^'aeceleration energy are combined to form stream 4. 

In all cases, the static log energy S^*d)e suppressed (via the _N qualifier). If none of the above rules 
apply for some required number of sfr^ms, then the parameter vector is simply incompatible with 
that form of observation. For exampieylthe parameter kind LPC_D_A cannot be split into 2 streams, 
instead 3 streams should be used. 
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Fig. 5.7 Example Stream Construct^^ 



Fig. 5.7 illustrates the way that streams are constructed for a numbe^^ common cases. As 
earlier, the choice of LPC as the static coefficients is purely for illustration and the same mechanism 
applies to all base parameter kinds. • 

As discussed further in the next section, multiple data streams are ofte^^sed with vector 
quantised data. In this case, each VQ symbol per input sample is placed in a sepqfr^te data stream. 



5.14 Vector Quantisation 



Although HTK was designed primarily for building continuous density HMM systems, it also sup- 
ports discrete density HMMs. Discrete HMMs arc particularly useful for modelling data which is 
naturally symbolic. They can also be used with continuous signals such as speech by quantising 
each speech vector to give a unique VQ symbol for each input frame. The HTK module HVQ 
provides a basic facility for performing this vector quantisation. The VQ table (or codebook) can 
be constructed using the HTK tool HQuant. 

When used with speech, the principle justification for using discrete HMMs is the much reduced 
computation. However, the use of vector quantisation introduces errors and it can lead to rather 
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fragile systems. For this reason, the use of continuous density systems is generally preferred. To 
facilitate the use of continuous density systems when there are computational constraints, HTK also 
allows VQ to be used as the basis for pre-selecting a subset of Gaussian components for evaluation 
at each time frame. 
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Discrete 
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Fig. 5.8 Using Vector Quant isatj^i:* 

Fig. 5.8 illustrates the different ways that VQ can be used in HTK for ^«r-6ingle data stream. For 
multiple streams, the same principles are applied to each stream individuaflv A converted speech 
waveform or file of parameter vectors can have VQ indices attached simply Nay specifying the name 
of a VQ table using the configuration parameter VQTABLE and by adding tke _V qualifier to the 
target kind. The effect of this is that each observation passed to a recogniser (c%i include both a 
conventional parameter vector and a VQ index. For continuous density HMM s;^^ms, a possible 
use of this might be to preselect Gaussians for evaluation (but note that HTK dq^^jnot currently 
support this facility). 

When used with a discrete HMM system, the continuous parameter vectors are ignored and 
only the VQ indices are used. For training and evaluating discrete HMMs, it is convenient to store 
speech data in vector quantised form. This is done using the tool HCopy to read in and vector 
quantise each speech file. Normally, HCopy copies the target form directly into the output file. 
However, if the configuration parameter SAVEASVQ is set, then it will store only the VQ indices and 
mark the kind of the newly created file as DISCRETE. Discrete files created in this way can be read 
directly by HParm and the VQ symbols passed directly to a tool as indicated by the lower part of 
Fig. 5.8. 

HVQ supports three types of distance metric and two organisations of VQ codebook. Each 
codebook consists of a collection of nodes where each node has a mean vector and optionally a 
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covariance matrix or diagonal variance vector. The corresponding distance metric used for each of 
these is simple Euclidean, full covariance Mahalanobis or diagonal covariance Mahalanobis. The 
codebook nodes are arranged in the form of a simple linear table or as a binary tree. In the linear 
case, the input vector is compared with every node in turn and the nearest determines the VQ 
index. In the binary tree case, each non-terminal node has a left and a right daughter. Starting 
with the top- most root node, the input is compared with the left and right daughter node and the 
nearest is selected. This process is repeated until a terminal node is reached. 

VQ Tables are stored externally in text files consisting of a header followed by a sequence of 
node entries. The header consists of the following information 



magic 

type 

mode 



numNodes 

numS 

swl,sw2,... 




gic number usually the original parameter kind 
jear tree, 1 = binary tree 
^onal covariance Mahalanobis 
ful^ covariance Mahalanobis 
EucUdean 
total nurnop^^of nodes in the codebook 
number of kraependent data streams 
width of eac^aa^a stream 

Every node has a unique integer identifier and consists of the following 



stream 

vqidx 

nodeld 

leftid 

rightid 

mean 

cov 



stream number for tjj^ node 
VQ index for this notie (0 if non-terminal) 
integer id of this node \ . 
integer id of left daught^^ode 
integer id of right daughtei^ode 
mean vector 

diagonal variance or full covariapce 

. 

The inclusion of the optional variance vector or covas^ance matrix depends on the mode in the 
header. If present they are stored in inverse form. In a onMry tree, the root id is always 1. In linear 
codebooks, the left and right daughter node id's are ignor^. 

5.15 Viewing Speech with HList *^ 

As mentioned in section 5.1, the tool HList provides a dual rold^li^HTK. Firstly, it can be used for 
examining the contents of speech data files. In general, HList di^^iys three types of information 

2. target header: requested using the -t option ^"^^ 

3. target data: printed by default. The begin and end samples of tj^displayed data can be 
specified using the -s and -e options. 

When the default configuration parameters are used, no conversions are applied and the target data 
is identical to the contents of the file. 

As an example, suppose that the file called timit.wav holds speech wavefor^^data using the 



1. source header: requested using the -h option 



TIMIT format. The command 

HList -h -e 49 -F TIMIT timit.wav 



would display the source header information and the first 50 samples of the file. The output would 
look something like the following 



Sample Bytes: 2 

Num Comps : 1 

Num Samples: 31437 

0: 8 -4 



Source: timit.wav — 

Sample Kind: WAVEFORM 
Sample Period: 62.5 us 
File Format: TIMIT 

Samples: 0->49 

-1 0 -2 -1 
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10: 


-1 


0 


-1 


-2 


-1 


1 


0 


-1 


-2 


1 


20: 


-2 


0 


0 


0 


2 


1 


-2 


2 


1 


0 


30: 


1 


0 


0 


-1 


4 


2 


0 


-1 


4 


0 


40: 


2 


2 


1 


-1 


-1 


1 


1 


2 


1 


1 



END 



The source information confirms that the file contains WAVEFORM data with 2 byte samples and 
31437 samples in total. The sample period is 62.5/^s which corresponds to a 16kHz sample rate. 
The displayed data is numerically small because it corresponds to leading silence. Any part of the 
file could be viewed b^suitable choice of the begin and end sample indices. For example, 

HList -s 5000 -e'xa49 -F TIMIT timit.wav 



would display samples SOCTDthrough to 5049. The output might look like the following 

\Vl Samples: 5000->5049 

^9 -252 23 
yCll 48 -19 



• 


5000 


85 


-116 


5010 


-100 


-123 


5020 


-189 


91 


5030 


90 


-1 


5040 


297 


50 



99 


69 


92 


79 


-166 


15 


111 


41 


-126 


-304 


134 


-174 


-55 


57 


155 


149 


-70 


91 


165 


240 


189 


193 


244 


198 


128 



t0 i 255 80 
3^ 154 68 
13 (V) 72 187 

END 

The second use of HList is to che'^)that input conversions are being performed properly. 
Suppose that the above TIMIT format file\^part of a database to be used for training a recogniser 
and that mel-frequency cepstra are to be usei;J/^ong with energy and the first differential coefficients. 
Suitable configuration parameters needed to f^^eve this might be as follows 

# Wave -> MFCC config file ^ 
SOURCEFORMAT = TIMIT # same as -Fjmn 
TARGETKIND = MFCC_E_D # MFCC + Energy^'Deltas 
TARGETRATE = 100000 # 10ms frame rate > 
WINDOWSIZE = 200000 # 20ms window y> 
NUMCHANS =24 # num filterbank clafi^s 

NUMCEPS =8 # compute cl to c8 ^ 

HList can be used to check this. For example, typing • » 

HList -C config -o -h -t -s 100 -e 104 -i 9 tiiflrt^av 

will cause the waveform file to be converted, then the source header^TJie target header and parameter 
vectors 100 through to 104 to be listed. A typical output would be^^foUows 

Source: timit.wav 

Sample Bytes: 2 Sample Kind: WAVEFORM 

Num Comps: 1 Sample Period: 62.5 us 

Num Samples: 31437 File Format: TIMIT 
Target 

Sample Bytes: 72 Sample Kind: MFCC_E_D 

Num Comps: 18 Sample Period: 10000.0 us ^ 

Num Samples: 195 File Format: HTK 
Observation Structure 

x: MFCC-1 MFCC-2 MFCC-3 MFCC-4 MFCC-5 MFCC-6 MFCC-7 MFCCSS' E 
Del-1 Del-2 Del-3 Del-4 Del-5 Del-6 Del-7 Del-8 DelE 

Samples: 100->104 

-6.646 -8.293 -15.601 -23.404 10.988 0.834 
-0.069 -4.935 2.309 -5.336 2.460 0.080 
-3.600 -11.030 -8.481 -21.210 10.472 0.777 
-0.665 -2.603 -0.194 -2.331 2.180 0.069 
-4.450 -12.045 -15.939 -22.082 14.794 0.830 
-0.067 -1.281 -0.410 1.312 1.021 0.005 
-6.114 -12.336 -15.115 -17.091 11.640 0.825 
-0.525 -1.237 -1.039 1.515 1.007 0.015 



;cS8^ 



100: 


3 


573 


-19 


729 


-1 


256 




3 


161 


-1 


913 


0 


573 


101: 


3 


372 


-16 


278 


-4 


683 




0 


608 


-1 


850 


-0 


903 


102: 


2 


823 


-15 


624 


-5 


367 




-0 


051 


0 


633 


-0 


881 


103: 


3 


752 


-17 


135 


-5 


656 




-0 


002 


-0 


204 


0 


015 
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104: 3.127 -16.135 -5.176 -5.727 -14.044 -14.333 -18.905 15.506 0.833 
-0.034 -0.247 0.103 -0.223 -1.575 0.513 1.507 0.754 0.006 
END 

The target header information shows that the converted data consists of 195 parameter vectors, 
each vector having 18 components and being 72 bytes in size. The structure of each parameter vector 
is displayed as a simple sequence of floating-point numbers. The layout information described in 
section 5.10 can be used to interpret the data. However, including the -o option, as in the example, 
causes HList to output a schematic of the observation structure. Thus, it can be seen that the first 
row of each sample q^tains the static coefficients and the second contains the delta coefficients. 
The energy is in the nn^kl column. The command line option -i 9 controls the number of values 
displayed per line and be used to aid in the visual interpretation of the data. Notice finally 
that the command line d^tion -F TIMIT was not required in this case because the source format 
was specified in the configig-ation file. 

It should be stressed tha^^hen HLiST displays parameterised data, it does so in exactly the 
form that observations are pa^ed to a HTK tool. So, for example, if the above data was input to 
a system built using 3 data streeCros, then this can be simulated by using the command line option 
-n to set the number of streams.v%>example, typing 

HList -C config -n 3 -o -S'^^0 -e 101 -i 9 timit.wav 

would result in the following output 

Observation Structure 

nTotal=18 nStatic=8 nDel=16 V(Sep=T 

MFCC-3V^CC-4 MFCC-5 MFCC-6 MFCC-7 MFCC-8 
Del-3 •fir%l-4 Del-5 Del-6 Del-7 Del-8 

SamplesV 1QO->101 

-1.256 -6.^ -8.293 -15.601 -23.404 10.988 
0.573 -0.06O-4.935 2.309 -5.336 2.460 

-4.683 -3.600 -L^030 -8.481 -21.210 10.472 
-0.903 -0.665 -^^03 -0.194 -2.331 2.180 

END — 



x.l 




MFCC-1 


MFCC-2 


x.2 




Del-1 


Del-2 


x.3 




E 


DelE 


100 


1 


3.573 


-19.729 


100 


2 


3.161 


-1.913 


100 


3 


0.834 


0.080 


101 


1 


3.372 


-16.278 


101 


2 


0.608 


-1.850 


101 


3 


0.777 


0.069 



Notice that the data is identical to the previous case, but it nasJbeen re-organised into separate 

5.16 Copying and Coding using HCopy 

HCoPY is a general-purpose tool for copying and manipulating speeM^files. The general form of 
invocation is 



HCopy src tgt 



which will make a new copy called tgt of the file called src. HCoPY can also jcmicatenate several 
sources together as in ^^-^ 



HCopy srcl + src2 + src3 tgt 



which concatenates the contents of srcl, src2 and src3, storing the results in the file tgt. As well 
as putting speech files together, HCopy can also take them apart. For example, 

HCopy -s 100 -e -100 src tgt 

will extract samples 100 through to N-lOO of the file src to the file tgt where N is the total number 
of samples in the source file. The range of samples to be copied can also be specified with reference 
to a label file, and modifications made to the speech file can be tracked in a copy of the label file. 
All of the various options provided by HCoPY are given in the reference section and in total they 
provide a powerful facility for manipulating speech data files. 
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However, the use of HCopy extends beyond that of copying, chopping and concatenating files. 
HCOPY reads in all files using the speech input/output subsystem described in the preceding 
sections. Hence, by specifying an appropriate configuration file, HCopy is also a speech coding 
tool. For example, if the configuration file conf ig was set-up to convert waveform data to MFCC 
coefficients, the command 

HCopy -C config -s 100 -e -100 src.wav tgt.mfc 

would parameterise the file waveform file src.wav, excluding the first and last 100 samples, and 
store the result in tgt .mf c. 

HCopy will procg^ its arguments in pairs, and as with all HTK tools, argument lists can be 
written in a script file specified via the -S option. When coding a large database, the separate 



invocation of HCopy m^each file needing to be processed would incur a very large overhead. 
Hence, it is better to cre*^^ a file, f list say, containing a list of all source and target files, as in 
for example, « 

srcl.wav tgtl.mfc \^ 
src2.wav tgt2.mfc 
src3.wav tgtS.mfc 
src4.wav tgt4.mfc 
etc 

and then invoke HCopy by 

HCopy -C config -s 100 -e -lOQ^S flist 

which would encode each file listed in f list^rn a single invocation. 

Normally HCoPY makes a direct copy^o^he target speech data in the output file. However, 
if the configuration parameter SAVECDMPRESS£D is set true then the output is saved in compressed 
form and if the configuration parameter SAVE^lS^HCRC is set true then a checksum is appended to 
the output (see section 5.10). If the configurati^ parameter SAVEASVQ is set true then only VQ 
indices are saved and the kind of the target file is «^nged to DISCRETE. For this to work, the target 
kind must have the qualifier _V attached (see sectioiL^14). 

WAVEFORM ^ 

/ 

LPREFC-^ ► LPC MELSPE^I^* ► FBANK 




t 



o 



t 



LPCEPSTRA MFCC 
Fig. 5.9 Valid Parameter Kind Conversions/ 

5.17 Version 1.5 Compatibility 

The redesign of the HTK front-end in version 2 has introduced a number of differene^yun parameter 
encoding. The main changes are 

1 . Source waveform zero mean processing is now performed on a frame- by- frame basis. 

2. Delta coefficients use a modified form of regression rather than simple differences at the start 
and end of the utterance. 

3. Energy scaling is no longer applied to the zero'th MFCC coefficient. 

If a parameter encoding is required which is as close as possible to the version 1.5 encoding, then 
the compatibility configuration variable VICOMPAT should be set to true. 

Note also in this context that the default values for the various configuration values have been 
chosen to be consistent with the defaults or recommended practice for version 1.5. 
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5.18 Summary 



This section summarises the various file formats, parameter kinds, qualifiers and configuration 
parameters used by HTK. Table 5.1 lists the audio speech file formats which can be read by the 
HWave module. Table 5.2 lists the basic parameter kinds supported by the HParm module and 
Fig. 5.9 shows the various automatic conversions that can be performed by appropriate choice of 
source and target parameter kinds. Table 5.3 lists the available qualifiers for parameter kinds. 
The first 6 of these are used to describe the target kind. The source kind may already have 
some of these, HParm adds the rest as needed. Note that HParm can also delete qualifiers when 
converting from sour^^to target. The final two qualifiers in Table 5.3 are only used in external 
files to indicate compreg^on and an attached checksum. HParm adds these qualifiers to the target 
form during output andrwdy in response to setting the configuration parameters SAVECOMPRESSED 
and SAVEWITHCRC. Addin^ftie _C or _K qualifiers to the target kind simply causes an error. Finally, 
Tables 5.4 and 5.5 lists all«of the configuration parameters along with their meaning and default 
values. \^ 



Name 



HTK 
TIMIT 
NIST 
SCRIBE 

SDESl 

AIFF 

SUNAU8 

OGI 

WAV 
ESIG 



AUDIO 
ALIEN 



NOHEAD 



Kind 



WAVEFORM 
LPC 

LPREFC 

LPCEPSTRA 

LPDELCEP 

IREFC 

MFCC 

FBANK 

MELSPEC 

USER 

DISCRETE 

PLP 

ANON 



cription 



the 



Tlf^tendard HTK file format 
As uig^ in the original prototype TIMIT CD-ROM 
The sWjiard SPHERE format used by the US NIST 
Subset oRthe European SAM standard used 
SCRIBE^^-ROM 
The SoundyBesigner 1 format defined by Digidesign Inc. 
Audio interchange file format 

Subset of 8bit\^*^u" and ".snd" formats used by Sun and 
NeXT 

Format used by^Qj«gan Graduate Institute similar to 
TIMIT C 
Microsoft WAVE fil^u'sed on PCs 
Entropic Esignal file f^tia t 



Pseudo format to indica^.direct audio input 
Pseudo format to indicate*^unsupported file, the alien 
header size must be set via. the environment variable 
HDSIZE Qv 
As for the ALIEN format but»header size is zero 



Table. 5.1 Supported File FSrrjiats 



Meaning 



scalar samples (usually raw speech 
linear prediction coefficients " 
linear prediction refiection coefficients 
LP derived cepstral coefficients 
LP cepstra + delta coef (obsolete) 
LPREFC stored as 16bit (short) integers 
mel-frequency cepstral coefficients 
log filter-bank parameters 
linear filter-bank parameters 
user defined parameters 
vector quantised codebook symbols 
perceptual linaer prediction coefficients 
matches actual parameter kind 



o 

% 
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Qualifier 



Meaning 




Acceleration coefficients appended 

External form is compressed 

Delta coefficients appended 

Log energy appended 

External form has checksum appended 

Absolute log energy suppressed 

Third differential coefficients appended 

VQ index appended 

Cepstral mean subtracted 

Cepstral CO coefficient appended 



'Table. 5.3 Parameter Kind Qualifiers 




Module 



Name 



Default 



Description 



HAUDIO 
HAuDio 
HAuDio 
HAuDio 
HAuDio 



HWave 
HWave 
HWave 
HWave 



HParm 
HParm 

HParm 
HParm 
HParm 
HParm 
HParm 
HParm 
HParm 
HParm 
HParm 
HParm 
HParm 
HParm 
HParm 
HParm 
HParm 
HParm 
HParm 
HParm 
HParm 
HParm 



LINEIN 
MICIN 
LINEOUT 
SPEAKEROUT 
PHONESOUT 
SOURCEKIND 
SOURCEFORMAT 
SOURCERATE 
NSAMPLES 
HEADERS I ZE 
STEREOMODE 
BYTEORDER 
NATURALREADORDER 
NATURALWRITEORDER 
TARGETKIND 
TARGETFORMAT 
TARGETRATE 
SAVECOMPRESSED 
SAVEWITHCRC 

ADDDITHER 

ZMEANSOURCE 

WINDOWSIZE 

USEHAMMING 

PREEMCOEF 

LPCORDER 

NUMCHANS 

LOFREQ 

HIFREQ 

USEPOWER 

NUMCEPS 

CEPLIFTER 

ENORMALISE 

ESCALE 

SILFLDOR 

DELTAWINDOW 

ACCWINDOW 

VQTABLE 

SAVEASVq 

AUDIOSIG 



F 
F 

ANON 

HTK 

0.0 

F 

T 

0.0 
F 

256000.0 
T 

0.97 

12 
20 

-1.0 

-1.0 

F 

12 

22 

T 

0.1 

50.0 

2 

2 

NULL 

F 

0 



Select line input for audio 
Select microphone input for audio 
Select line output for audio 
Select speaker output for audio 
Select headphones output for audio 
Parameter kind of source 
File format of source 
Sample period of source in 100ns units 
Num samples in alien file input via a pipe 
Size of header in an alien file 
^lect channel: RIGHT or LEFT 
\Define byte order VAX or other 
^^itable natural read order for HTK files 
En^e natural write order for HTK files 
Parameter kind of target 
File ^jJ^nt of target 
Samples period of target in 100ns units 
Save thevpatput file in compressed form 
Attach a shecksum to output parameter 
file ^Q 

Level of noise ajj/^ed to input signal 
Zero mean source-waveform before analysis 
Analysis windovksize in 100ns units 
Use a Hamming ^^^ow 
Set pre-emphasis co^lcient 
Order of LPC analya^^ 
Number of filterbank cITaMiiels 
Low frequency cut-off iirro^k analysis 
High frequency cut-off in sbank analysis 
Use power not magnitude injbank analysis 
Number of cepstral paramete](5^ 
Cepstral liftering coefficient 
Normalise log energy 
Scale log energy 
Energy silence floor (dB) 
Delta window size 
Acceleration window size 
Name of VQ table 
Save only the VQ indices 
Audio signal number for remote control 



Table. 5.4 Configuration Parameters 
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Module 



Name 



Default Description 



HParm 
HParm 

HParm 

HParm 

HParm 
HParm 

HParm 

HParm 
HParm 

HParm 

HParm 



USESILDET 
MEASURESIL 

DUTSILWARN 

SPEECHTHRESH 

SILENERGY 
SPCS^COUNT 



SPCGLi 




lUNT 



SILSEQCOUNI^ 



F 
T 

T 

9.0 

0.0 

10 

0 

100 



SILGLCHCOUNT' 



SILMARGIN 



VICOMPAT 
TRACE 



0 \ 



Enable speech/silence detector 

Measure background noise level prior to 

sampling 

Print a warning message to stdout before 

measuring audio levels 

Threshold for speech above silence level 

(dB) 

Average background noise level (dB) 
Window over which speech/silence decision 
reached 

Maximum number of frames marked as 
silence in window which is classified as 
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Many of the operations performed by HTK which invol\^^peech data files assume that the 
speech is divided into segments and each segment has a name axletbel. The set of labels associated 
with a speech file constitute a transcription and each transcription-is stored in a separate label file. 
Typically, the name of the label file will be the same as the correeppnding speech file but with a 
different extension. For convenience, label files are often stored in a^sgparate directory and all HTK 
tools have an option to specify this. When very large numbers of fil^^^re being processing, label 
file access can be greatly facilitated by using Master Label Files (MLF^}^i^1jFs may be regarded as 
index files holding pointers to the actual label files which can either be embedded in the same index 
file or stored anywhere else in the file system. Thus, MLFs allow large setVmiiles to be stored in a 
single file, they allow a single transcription to be shared by many logical iSoel files and they allow 
arbitrary file redirection. • 

The HTK interface to label files is provided by the module HLabel which ii^p^ements the MLF 
facility and support for a number of external label file formats. All of the faci^Hjes supplied by 
HLabel, including the supported label file formats, are described in this chapC^^^In addition, 
HTK provides a tool called HLEd for simple batch editing of label files and this is*^so described. 
Before proceeding to the details, however, the general structure of label files will be reviewed. 



6.1 Label File Structure 

Most transcriptions are single-alternative and single-level, that is to say, the associated speech file 
is described by a single sequence of labelled segments. Most standard label formats are of this 
kind. Sometimes, however, it is useful to have several levels of labels associated with the same basic 
segment sequence. For example, in training a HMM system it is useful to have both the word level 
transcriptions and the phone level transcriptions side-by- side. 
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Orthogonal to the requirement for multiple levels of description, a transcription may also need 
to include multiple alternative descriptions of the same speech file. For example, the output of 
a speech recogniser may be in the form of an N-hest list where each word sequence in the list 
represents one possible interpretation of the input. 

As an example. Fig. 6.1 shows a speech file and three difl^erent ways in which it might be 
labelled. In part (a), just a simple orthography is given and this single-level single-alternative type 
of transcription is the commonest case. Part (b) shows a 2-level transcription where the basic 
level consists of a sequence of phones but a higher level of word labels are also provided. Notice 
that there is a distinction between the basic level and the higher levels, since only the basic level 
has explicit boundari^^cations marked for every segment. The higher levels do not have explicit 
boundary information ^^e this can always be inferred from the basic level boundaries. Finally, 
part (c) shows the case wfeere knowledge of the contents of the speech file is uncertain and three 



possible word sequences are given. 

HTK label files support multiple-alternative and multiple-level transcriptions. In addition to 
start and end times on the H^c level, a label at any level may also have a score associated with 
it. When a transcription is l»Med, all but one specific alternative can be discarded by setting 
the configuration variable TRAlJfSA^T to the required alternative N, where the first (i.e. normal) 
alternative is numbered 1. Sinil^jij^, all but a specified level can be discarded by setting the 
configuration variable TRANSLEV tcv^Aie required level number where again the first (i.e. normal) 
level is numbered 1. ^) 

All non-HTK formats are limited t^^ngle-level single-alternative transcriptions. 



6.2 Label File Formats V^' 

As with speech data files, HTK not only deraips its own format for label files but also supports 
a number of external formats. Defining an extepml format is similar to the case for speech data 
files except that the relevant configuration variables for specifying a format other than HTK are 
called SDURCELABEL and TARGETLABEL. The sourc^s(label format can also be specified using the -G 
command line option. As with using the -F comman^l^ne option for speech data files, the -G option 
overrides any setting of SDURCELABEL (\ 

6.2.1 HTK Label Files 

The HTK label format is text based. As noted above, a jingle label file can contain multiple- 
alternatives and multiple-levels. Y) 

Each line of a HTK label file contains the actual label opfioftelly preceded by start and end 
times, and optionally followed by a match score. \ 

[start [end] ] name [score] ■[ auxname [auxscore] >"~>6piimient] 




where start denotes the start time of the labelled segment in 100ns ui^ij^, end denotes the end time 
in 100ns units, name is the name of the segment and score is a fioating^^int confidence score. All 
fields except the name are optional. If end is omitted then it is set equai_to -1 and ignored. This 
case would occur with data which had been labelled frame synchronous^cJ If start and end are 
both missing then both are set to -1 and the label file is treated as a simple (symbolic transcription. 
The optional score would typically be a log probability generated by a recjignition tool. When 
omitted the score is set to 0.0. /—s 

The following example corresponds to the transcription shown in part (a) or-Ejg. 6.1 



0000000 3600000 ice 
3600000 8200000 cream 



Multiple levels are described by adding further names alongside the basic name. The lowest level 
(shortest segments) should be given first since only the lowest level has start and end times. The 
label file corresponding to the transcription illustrated in part (b) of Fig. 6.1 would be as follows. 

0000000 2200000 ay ice 

2200000 3600000 s 

3600000 4300000 k cream 

4300000 5000000 r 

5000000 7400000 iy 

7400000 8200000 m 
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Finally, multiple alternatives are written as a sequence of separate label lists separated by three 
slashes (///)■ The label file corresponding to the transcription illustrated in part (c) of Fig. 6.1 
would therefore be as follows. 

0000000 2200000 I 
2200000 8200000 scream 
/// 

0000000 3600000 ice 
3600000 8200000 cream 

/// A 

0000000 36000d^-^yes 
3600000 820000o4Keam 



Actual label names d^^be any sequence of characters. However, the - and + characters are 
reserved for identifying thejeft and right context, respectively, in a context-dependent phone label. 
For example, the label N-aa+.^/^iight be used to denote the phone aa when preceded by a nasal and 
followed by a vowel. These couteKt-dependency conventions are used in the label editor HLEd, and 
are understood by all HTK toofe^^ . 

6.2.2 ESPS Label Files V 

An ESPS/waves+ label file is a text fi^yftsith one label stored per line. Each label indicates a segment 
boundary. A complete description of t!^ESPS/waves+ label format is given in the ESPS/waves+ 
manual pages xwaves (1-ESPS) and xl^el (1-ESPS). Only details required for use with HTK 
are given here. \^ 

The label data follows a header which en^'^jvith a line containing only a #. The header contents 
are generally ignored by HLabel. The labelS^^ow the header in the form 

time ccode name 

where time is a floating point number which den^j^s the boundary location in seconds, ccode is 
an integer color map entry used by ESPS/waves+ in^rawing segment boundaries and name is the 
name of the segment boundary. A typical value for ccpde is 121. 

While each HTK label can contain both a start anH-'mi.end time which indicate the boundaries 
of a labeled segment, ESPS/waves+ labels contain a siiigle time in seconds which (by convention) 
refers to the end of the labeled segment. The starting tii^)of the segment is taken to be the end 
of the previous segment and 0 initially. ^ 

ESPS/waves+ label files may have several boundary nanae^per line. However, HLabel only 
reads ESPS/waves+ label files with a single name per boimdaj^. Multiple-alternative and/or 
multiple-level HTK label data structures cannot be saved using\^^6/u;at;e,s+ format label files. 

6.2.3 TIMIT Label Files ^ 

TIMIT label files are identical to single-alternative sing le-level HTK la^i lies without scores except 
that the start and end times are given as sample numbers rather than aisaelute times. TIMIT label 
files are used on both the prototype and final versions of the TIMIT CD @)M. 

6.2.4 SCRIBE Label Files , 

The SCRIBE label file format is a subset of the European SAM label file forn^ SAM label files 
are text files and each line begins with a label identifying the type of informati^^ stored on that 
line. The HTK SCRIBE format recognises just three label types 

LBA - acoustic label 

LBB - broad class label 

UTS ~ utterance 

For each of these, the rest of the line is divided into comma separated fields. The LBA and LBB 
types have 4 fields: start sample, centre sample, end sample and label. HTK expects the centre 
sample to be blank. The UTS type has 3 fields: start sample, end sample and label. UTS labels 
may be multi-word since they can refer to a complete utterance. In order to make such labels usable 
within HTK tools, between word blanks are converted to underscore characters. The EX command 
in the HTK label editor HLEd can then be used to split such a compound label into individual 
word labels if required. 
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6.3 Master Label Files 
6.3.1 General Principles of MLFs 

Logically, the organisation of data and label files is very simple. Every data file has a label file of 
the same name (but different extension) which is either stored in the same directory as the data 
file or in some other specified directory. 
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(c) 3-alternative, 1-lev^ 



Fig. 6.1 Example Transc^ipi^^s 

This scheme is sufficient for most needs and commendably simpi;^ However, there are many 
cases where either it makes unnecessarily inefficient use of the opei^t^i^ system or it seriously 
inconveniences the user. For example, to use a training tool with isolated>word data may require 
the generation of hundreds or thousands of label files each having just one-mbel entry. Even where 
individual label files are appropriate (as in the phonetically transcribed XIMIT database), each 
label file must be stored in the same directory as the data file it transcribes, or all label files must 
be stored in the same directory. One cannot, for example, have a different direcl^Siy of label files for 
each TIMIT dialect region and then run the HTK training tool HERest on the/wfaole database. 

All of these problems can be solved by the use of Master Label Files (MLFs).^>!tery HTK tool 
which uses label files has a -I option which can be used to specify the name of an K^F file. When 
an MLF has been loaded, the normal rules for locating a label file apply except that the MLF 
is searched first. If the required label file f is found via the MLF then that is loaded, otherwise 
the file f is opened as normal. If f does not exist, then an error is reported. The -I option may 
be repeated on the command line to open several MLF files simultaneously. In this case, each is 
searched in turn before trying to open the required file. 

MLFs can do two things. Firstly, they can contain embedded label definitions so that many or 
all of the needed label definitions can be stored in the same file. Secondly, they can contain the 
names of sub-directories to search for label files. In effect, they allow multiple search paths to be 
defined. Both of these two types of definition can be mixed in a single MLF. 
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MLFs are quite complex to understand and use. However, they add considerable power and 
flexibility to HTK which combined with the -S and -L options mean that virtually any organisation 
of data and label files can be accommodated. 

6.3.2 Syntax and Semantics 

An MLF consists of one or more individual definitions. Blank lines in an MLF are ignored but 
otherwise the line structure is significant. The first line must contain just #!MLF!# to identify it 
as an MLF file. This is not necessary for use with the -I option but some HTK tools need to be 
able to distinguish ansMLF from a normal label file. The following syntax of MLF files is described 
using an extended BN^^otation in which alternatives are separated by a vertical bar | , parentheses 
( ) denote factoring, brsKKets [ ] denote options, and braces { } denote zero or more repetitions. 

MLF = "#!^LF!#" 

MLFC^e^{ MLFDef } 

Each definition is either a t^^scription for immediate loading or a subdirectory to search. 



MLFDef = Immediate^^^cription | SubDirDef 

An immediate transcription cons^^ of a pattern on a line by itself immediately followed by a 
transcription which as far as the MLF (^oncerned is arbitrary text. It is read using whatever label 
file "driver" routines are installed in Hbt^EL. It is terminated by a period written on a line of its 
own. 

ImmediateTranscription = 

Pattern \^ 
Transcription ^ 



A subdirectory definition simply gives the nam^ of a subdirectory to search. If the required 
label file is found in that subdirectory then the label^^e is loaded, otherwise the next matching 
subdirectory definition is checked. 

SubDirDef = Pattern SearchMode String 

SearchMode = "->" | "=>" • . 

The two types of search mode are described below. A pattern i^^jjist a string 
Pattern — String 

except that the characters '?' and embedded in the string act a svftcards such that '?' matches 
any single character and matches 0 or more characters. A string t^\ny sequence of characters 
enclosed in double quotes. ^"X^ 

o 

6.3.3 MLF Search 

The names of label files in HTK are invariably reconstructed from an existing data file name and 
this means that the file names used to access label files can be partial or full p^^ names in which 
the path has been constructed either from the path of the corresponding data'iile or by direct 
specification via the -L option. These path names are retained in the MLF searcn^^s^ich proceeds 
as follows. The given label file specification . . /d3/d2/dl/name is matched against ^«ch pattern in 
the MLF. If a pattern matches, then either the named subdirectory is searched or an immediate 
definition is loaded. Pattern matching continues in this way until a definition is found. If no 
pattern matches then an attempt is made to open . . /d3/d2/dl/nemie directly. If this fails an error 
is reported. 

The search of a sub-directory proceeds as follows. In simple search mode indicated by ->, the file 
name must occur directly in the sub-directory. In full search mode indicated by =>, the files nsune, 
dl/nsune, d2/dl/name, etc. are searched for in that order. This full search allows a hierarchy of 
label files to be constructed which mirrors a hierarchy of data files (see Example 4 below) . 

Hashing is performed when the label file specification is either a full path name or in the form 
*/f ile so in these cases the search is very fast. Any other use of metacharacters invokes a linear 
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search with a full and relatively slow pattern match at each step. Note that all tools which generate 
label files have a -1 option which is used to define the output directory in which to store individual 
label files. When outputting master label files, the -1 option can be used to define the path in the 
output label file specifications. In particular, setting the option -1 causes the form */file to 
be generated. 



6.3.4 MLF Examples 

1. Suppose a data set consisted of two training data files with corresponding label files: 
a . lab contains^^j^ 

000000 ^^000 sil 
600000 20^00 a 



b . lab contains 



2100000 4500000 sil 

000000 990000 

1000000 3090000 b"^ 

3100000 4200000 si](^ 




Then the above two individual lab^*files could be replaced by a single MLF 



# ! MLF ! # 

"*/a.lab" 
000000 590000 sil \^ 
600000 2090000 a A 

2100000 4500000 sil ^ 



(J) 



"*/b.lab" 

000000 990000 sil 

1000000 3090000 b 

3100000 4200000 sil 



A digit data base contains training tokens one . 1 . wav , oaeJ 2 . wav , one . 3 . wav , . . . , two . 1 . 
two. 2.wav, two. 3. wav, . . ., etc. Label files are requir^^^ontaining just the name of the 
model so that HTK tools such as HERest can be used. If M^s are not used, individual label 
files are needed. For example, the individual label files one . 1 . jtaij , one . 2 . lab , one . 3 . lab , 
.... would be needed to identifiy instances of "one" even th)s^M.each file contains the same 
entry, just V) . 



Using an MLF containing 



# ! MLF ! # 



o 



*/one.*.lab" O 



one 



"*/two.*.lab" 
two 

"*/three.*.lab" 
three 

<etc.> 



avoids the need for many duplicate label files. 
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3. A training database /db contains directories drl, dr2, . . . , dr8. Each directory contains 
a subdirectory called labs holding the label files for the data files in that directory. The 
following MLF would allow them to be found 



# ! MLF ! # 

-> Vdb/drl/labs" 
-> '7db/dr2/labs" 



-> '7db/dr7/labs" 
"*" -> .<^b/dr8/labs" 

Each attempt to opgfe a label file will result in a linear search through drl to dr8 to find that 
file. If the sub-direcxory name is embedded into the label file name, then this searching can 
be avoided. For exanlple, if the label files in directory drx had the form drx_xxxx.lab, then 
the MLF would be wri t§h as 

# ! MLF ! # 

"*/drl_*" -> '7dbM;el/labs" 
"*/dr2_*" -> Vdb/d^labs" 

"*/dr7_*" -> '7db/dr7|(^bs" 
"*/dr8_*" -> '7db/dr8/l,«^" 

V' 

4. A training database is organised as a hiemrchy where /diskl/db/drl/sp2/u3 . wav is the data 
file for the third repetition from speakef^^ in dialect region drl (see Figure 6.2). 

' ^ ' 

disk I disk2 f ^ disk3 

I I I C> I I I I I 

I \ I I I [Trd I I I 

di-l dr2 di-3 dr4 ... spP0>.«p2 sp3 sp4 ... 



I — I — \ — I — I I — •! > I r 

spl sp2 sp3 sp4 . . . ul.lab uMaoN u3.lab u4.lab 

I — h 1 1 1 

ul.wav u2.wav u3.wav u4.wav ... 



uMaFN u3.la 

O 



Fig. 6.2 Database Hierarchy: Data,fLeft]; 
Labels [Right]. 

Suppose that a similar hierarchy of label files was constructed on drskS. These label files 

# ! MLF ! # Q 
"*" => VdiskS" 

If for some reason all of the drN directories were renamed IdrN in the label hierarchy, then 
this could be handled by an MLF file containing 



# ! MLF ! # 

"*/drl/*" => '7disk3/ldrl 
"*/dr2/*" => '7disk3/ldr2 
"*/dr3/*" => '7disk3/ldr3 
etc . 
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These few examples should illustrate the flexibility and power of MLF files. It should noted, 
however, that when generating label names automatically from data file names, HTK sometimes 
discards path details. For example, during recognition, if the data files /diskl/dr2/sx43.wav and 
/disk2/dr4/sx43.wav are being recognised, and a single directory is specified for the output label 
files, then recognition results for both files will be written to a file called sx43.1ab, and the latter 
occurrence will overwrite the former. 



6.4 Editing Label Files 

HTK training tools t^ijsally expect the labels used in transcription files to correspond directly to 
the names of the HMMsStosen to build an application. Hence, the label files supplied with a speech 
database will often need j^idifying. For example, the original transcriptions attached to a database 
might be at a fine level of £||;oustic detail. Groups of labels corresponding to a sequence of acoustic 
events (e.g. pel p') might newd converting to some simpler form (e.g. p) which is more suitable for 
being represented by a HMMr^t^ a second example, current high performance speech recognisers 
use a large number of context ^pendent models to allow more accurate acoustic modelling. For 
this case, the labels in the transc^^i^n must be converted to show the required contexts explicitly. 

HTK supplies a tool called Hl3^^ for rapidly and efficiently converting label files. The HLEd 
command invocation specifies the na^«s of the files to be converted and the name of a script file 
holding the actual HLEd commands. Keu: example, the command 

HLEd edfile.led 11 12 13 

would apply the edit commands stored in tH^file edf lie . led to each of the label files 11, 12 and 13. 
More commonly the new label files are store'tf^ a new directory to avoid overwriting the originals. 
This is done by using the -1 option. For exan^j^, 

HLEd -1 newlabs edfile.led 11 12 13 

would have the same effect as previously except ^haj^ the new label files would be stored in the 
directory newlabs. \ . 

Each edit command stored in an edit file is identifie;^ by a mnemonic consisting of two letters^ 
and must be stored on a separate line. The supplied^^it commands can be divided into two 
groups. The first group consist of commands which perfori'^elective changes to specific labels and 
the second group contains commands which perform global transformations. The reference section 
defines all of these commands. Here a few examples will be^iv^ to illustrate the use of HLEd. 

As a first example, when using the TIMIT database, the original 61 phoneme symbol set is often 
mapped into a simpler 48 phoneme symbol set. The aim of ti^s mapping is to delete all glottal 
stops, replace all closures preceding a voiced stop by a generic ^ked closure (vcl), all closures 
preceding an unvoiced stop by a generic unvoiced closure (cl) anome different types of silence to 
a single generic silence (sil). A HLEd script to do this might be^^-*^^ 

# Map 61 Phone Timit Set -> 48 Phones 

RE vcl bcl del gel • 

RE sil h# #h pau 

The first line is a comment indicated by the initial hash character. The comma^^pn the second 
line is the Sort command SO. This is an example of a global command. Its effect 3vjP sort all the 
labels into time order. Normally the labels in a transcription will already be in time order but 
some speech editors simply output labels in the order that the transcriber marked them. Since this 
would confuse the re-estimation tools, it is good practice to explicitly sort all label files in this way. 

The command on the third line is the Delete command DE. This is a selective command. Its 
effect is to delete all of the labels listed on the rest of the command line, wherever they occur. In 
this case, there is just one label listed for deletion, the glottal stop q. Hence, the overall effect of 
this command will be to delete all occurrences of the q label in the edited label files. 

The remaining commands in this example script are Replace commands RE. The effect of a Re- 
place command is to substitute the first label following the RE for every occurrence of the remaining 



DE q 

RE cl pel tel kel qcl 



^ Some command names have single letter alternatives for compatibility with earlier versions of HTK. 
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labels on that line. Thus, for example, the command on the third line causes all occurrences of the 
labels pel, tcl, kcl or qcl to be replaced by the label cl. 

To illustrate the overall effect of the above HLEd command script on a complete label file, the 
following TIMIT format label file 

0000 2241 h# 

2241 2715 w 

2715 4360 ow 

4360 5478 bcl 

5478 5643 b ^ 

5643 6360 iy^^ 

6360 7269 tcl 

7269 8313 t ^ 

8313 11400 ay , 
11400 12950 del ^ 
12950 14360 dh 
14360 14640 h# 

would be converted by the above ^f>m>t to the following 

Notice that label boundaries in TIMIT format are given ^ii\erms of sample numbers (16kHz sample 
rate), whereas the edited output file is in HTK format j^Xv^hich all times are in absolute 100ns 
units. , 

As well as the Replace command, there is also a Merge coBmiand ME. This command is used to 
replace a sequence of labels by a single label. For example, tne foilowing commands would merge 
the closure and release labels in the previous TIMIT transcripti^^jnto single labels 

ME b bcl b r\ 
ME d del dh ^(>> 
ME t tel t ^\ 

As shown by this example, the label used for the merged sequence can beiiie same as occurs in the 
original but some care is needed since HLEd commands are normally applre^in sequence. Thus, a 
command on line n is applied to the label sequence that remains after the commands on lines 1 to 
n — 1 have been applied. • 

There is one exception to the above rule of sequential edit command applic^i^on. The Change 
command CH provides for context sensitive replacement. However, when a seqtisnce of Change 
commands occur in a script, the sequence is applied as a block so that the contexts^j^ich apply for 
each command are those that existed just prior to the block being executed. The CH$nge command 
takes 4 arguments X A Y B such that every occurrence of label Y in the context of A _ B is changed 
to the label X. The contexts A and B refer to sets of labels and are defined by separate Define Context 
commands DC. The CH and DC commands are primarily used for creating context sensitive labels. For 
example, suppose that a set of context-dependent phoneme models are needed for TIMIT. Rather 
than treat all possible contexts separately and build separate triphones for each (see below), the 
possible contexts will be grouped into just 5 broad classes: C (consonant), V (vowel), N (nasal), 
L (liquid) and S (silence). The goal then is to translate a label sequence such as sil b all t iy 
n . . . into sil+C S-b+V C-ah+C V-t+V C-iy+N V-n+ . . . where the - and + symbols within a 
label are recognised by HTK as defining the left and right context, respectively. To perform this 
transformation, it is necessary to firstly use DC commands to define the 5 contexts, that is 



0 


1400625 


sil 


1400625 


1696875 


w 


1696875 


2725000 


ow 


2725000 


3423750 


vel 


3423750 


3526875 


b 


3526875 


3975000 


iy 


3975000 


4543125 


el 


4543125 


5195625 


t 


5195625 


7125000 


ay 


7125000 


8093750 


vel 


8093750 


8975000 


dh 


8975000 


9150000 


sil 



6.4 Editing Label Files 



95 



DC V iy ah ae eh ix ... 
DC C t k d k g dh ... 
DC L 1 r w j ... 
DC N n m ng ... 
DC S h# #h epi ... 

Having defined the required contexts, a change command must be written for each context depen- 
dent triphonc, that is 

CH V-ah+V V ah V 

CH V-ah+C V afe^ 

CH V-ah+N V ah 1^ 

CH V-ah+L V ah ^^^^ 

etc • 

This script will, of course, be^S^r long (25 x number of phonemes) but it can easily be generated 
automaticaly by a simple progr^nn or shell script. 

The previous example shows ^t o transform a set of phonemes into a context dependent set in 
which the contexts are user-definedc^or convenience, HLEd provides a set of global transformation 
commands for converting phonemic/conscriptions to conventional left or right biphoncs, or full 
triphones. For example, a script containing the single Triphone Conversion command TC will 
convert phoneme files to regular tripibones. As an illustration, applying the TC command to a 
file containing the sequence sil b ah \Ay n . . . would give the transformed sequence sil+b 
sil-b+ah b-ah+t ah-t+iy t-iy+n iy-nv^. . . . Notice that the first and last phonemes in the 
sequence cannot be transformed in the norna^^ way. Hence, the left-most and right-most contexts 
of these start and end phonemes can be speafmd explicitly as arguments to the TC commands if 
required. For example, the command TC # # wmjki give the sequence #-sil+b sil-b+ah b-ath+t 
Eih-t+iy t-iy+n iy-n+ . . . +#. Also, the concexts^at pauses and word boundaries can be blocked 
using the WB command. For example, if WB sp was'^ecuted, the effect of a subsequent TC command 
on the sequence sil b ah t sp iy n . . . would bqj^ give the sequence sil+b sil-b+ah b-sth+t 
cih-t sp iy+n iy-n+ . . . , where sp represents a short.pause. Conversely, the NB command can 
be used to ignore a label as far as context is concerneS^^or example, if NB sp was executed, the 
effect of a subsequent TC command on the sequence sii li ah t sp iy n . . . would be to give 
the sequence sil+b sil-b+ah b-ah+t ah-t+iy sp t-iyxV iy-n+ .... 

When processing HTK format label files with multiple ievek, only the level 1 (i.e. left-most) 
labels are affected. To process a higher level, the Move LeiN^npommand ML should be used. For 
example, in the script 

ML 2 Q 
RE one 1 r\ 
RE two 2 ^^5^ 



the Replace commands are applied to level 2 which is the first level afierKe the basic level. The 
command ML 1 returns to the base level. A complete level can be deletMUby the Delete Level 
command DL. This command can be given a numeric argument to delete a ^ecific level or with no 
argument, the current level is deleted. Multiple levels can also be split into simgle level alternatives 
by using the Split Level command SL. 

When processing HTK format files with multiple alternatives, each alternatij^is processed as 
though it were a separate file. 



Remember also that in addition to the explicit HLEd commands, levels and alt^mtives can be 
filtered on input by setting the configuration variables TRANSLEV and TRANSALT (see section 6.1). 

Finally, it should be noted that most HTK tools require all HMMs used in a system to be defined 
in a HMM List. HLEd can be made to automatically generate such a list as a by-product of editing 
the label files by using the -n option. For example, the following command would apply the script 
timit.led to all files in the directory tlabs, write the converted files to the directory hlabs and 
also write out a list of all new labels in the edited files to tlist. 

HLEd -n tlist -1 hlabs -G TIMIT timit.led tlabs/* 

Notice here that the -G option is used to inform HLEd that the format of the source files is TIMIT. 
This could also be indicated by setting the configuration variable SOURCELABEL. 
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6.5 Summary 



Table 6.1 lists all of the configuration parameters recognised by HLabel along with a brief descrip- 
tion. A missing module name means that it is recognised by more than one module. 



Module Name 



HLabel 
HLabel 
HLabel 
HLabel 
HLabel 
HLabel 
HLabel 
HLabel 



LABELSQUOTE 
SDURCELABEL 
SDURCERATE 
^TRIPTRIPHONES 
^RGETLABEL 
m^SALT 
TR^LEV 
VlCQi^PAT 
TRACEy 



Description 



Specify label quote character 

Source label format 

Sample period for SCRIBE format 

Remove triphone contexts on input 

Target label format 

Filter alternatives on input 

Filter levels on input 

Version 1.5 compatibility mode 

trace control (default=0) 



Table. 6.1^^onfiguration Parameters used with Labels 

(J) 

o 

o 



o 

% 
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HMM Etefinition Files 



Speei 
Di 



Lattices/ 
Language Constraint 



HMM 



ata Labels Mo'Sels' Network Dictionary Detinitions 



HAudio 
HWave 
HParm 
HVQ 


HLabel 




HNet 


HDict 


HTK Tol^^ 


HModel 
HUtil 


HSigP 


HShell 


HMem 


HGraf 


HMath 


HTrain 


HFB 


HAdhpt 


1 HRec 



Terminal 
I/O 



Graphical 



jraphi 
170 



Model Adaptation 
Training 



The principle function of HTK is to manipulate sets of rdd<fen Markov models (HMMs). The 
definition of a HMM must specify the model topology, the transition parameters and the output 
distribution parameters. The HMM observation vectors can be^^vided into multiple independent 
data streams and each stream can have its own weight. In addi^i^n, a HMM can have ancillary 
information such as duration parameters. HTK supports both coi^ijjuous mixture densities and 
discrete distributions. HTK also provides a generalised tying mechanism which allows parameters 
to be shared within and between models. ^ 



In order to encompass this rich variety of HMM types within a single framework, HTK uses 
a formal language to define HMMs. The interpretation of this language ©handled by the library 
module HModel which is responsible for converting between the external^^d internal represen- 
tations of HMMs. In addition, it provides all the basic probability function calculations. A second 
module HUtil provides various additional facilities for manipulating HMMs o^se they have been 
loaded into memory. ^ 

The purpose of this chapter is to describe the HMM definition language in some detail. The 
chapter begins by describing how to write individual HMM definitions. HTK nj^ros are then 
explained and the mechanisms for defining a complete model set are presented. The various flavours 
of HMM are then described and the use of binary files discussed. Finally, a formal description of 
the HTK HMM definition language is given. 

As will be seen, the definition of a large HMM system can involve considerable complexity. 
However, in practice, HMM systems are built incremently. The usual starting point is a single 
HMM definition which is then repeatedly cloned and refined using the various HTK tools (in 
particular, HERest and HHEd). Hence, in practice, the HTK user rarely has to generate complex 
HMM definition files directly. 
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7.1 The HMM Parameters 

A HMM consists of a number of states. Each state j has an associated observation probabihty 
distribution hjipt) which determines the probabihty of generating observation Ot at time t and each 
pair of states i and j has an associated transition probabihty aij. In HTK the entry state 1 and 
the exit state iV of an iV state HMM are non-emitting. 




Fig. 7.r,^5!imple Left-Right HMM 

Fig. 7.1 shows a simple left-right HMM ^lyth five states in total. Three of these are emitting 
states and have output probability distributid^i- associated with them. The transition matrix for 
this model will have 5 rows and 5 columns. Eactv-nsw will sum to one except for the final row which 
is always all zero since no transitions are alloweJ oijt'of the final state. 

HTK is principally concerned with continuous uej^ity models in which each observation prob- 
ability distribution is represented by a mixture Gaussian density. In this case, for state j the 
probability bj{ot) of generating observation Ot is giveCj^ 



bj{ot) = n 



m— 1 



7-^r 



(7.1) 



where Mj^ is the number of mixture components in state j'^r stream s, Cjsm is the weight of 
the m'th component and Af{-; fi, S) is a multivariate Gaussian With mean vector fi and covariance 
matrix S, that is o 

where n is the dimensionality of o. The exponent 7^ is a stream weigh^-'^n)! its default value is one. 
Other values can be used to emphasise particular streams, however, nc^^of the standard HTK 
tools manipulate it. 



HTK also supports discrete probability distributions in which case 

s 



bj{Ot)^l[{P,s[Vs{Ost)]V 



(7.3) 



\V\ IS 



where Vs{ogt) is the output of the vector quantiser for stream s given input vector and P^, 
the probability of state j generating symbol v in stream s. 

In addition to the above, any model or state can have an associated vector of duration parameters 
{dk}^. Also, it is necessary to specify the kind of the observation vectors, and the width of the 
observation vector in each stream. Thus, the total information needed to define a single HMM is 
as follows 



• type of observation vector 

• number and width of each data stream 



^ No current HTK tool can estimate or use these. 
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• optional model duration parameter vector 

• number of states 

• for each emitting state and each stream 

— mixture component weights or discrete probabilities 

— if continuous density, then means and covariances 

— optional stream weight vector 

— optional d^^tion parameter vector 

• transition matrix'^A 

The following sections ex^sSn how these are defined. 



7.2 Basic HMM ^finitions 

Some HTK tools require a sing [ to be defined. For example, the isolated-unit re-estimation 

tool HRest would be invoked as 



HRest hmmdef si s2 s3 



This would cause the model defined in xna file hmmdef to be input and its parameters re-estimated 
using the speech data files si, s2, etc. 



<BeginHMM>^ 

<VecSize> C <I\/1FCC> 
<NumStates>ys 
<State> 2 
<Mean> 4 

0.2 0.1 OTpq.9 
<Variance> 4 ^ 
1.0 1.0 1.0 
<State> 3 
<Mean> 4 

0.4 0.9 0.2 0.1 
<Variance> 4 

1.0 2.0 2.0 0.5 
<State> 4 
<Mean> 4 

1.2 3.1 0.5 0.9 
<Variance> 4 

5.0 5.0 5.0 5.0 
<TransP> 5 

0.0 0.5 0.5 0.0 0.0 
0.0 0.4 0.4 0.2 0.0 
0.0 0.0 0.6 0.4 0.0 
0.0 0.0 0.0 0.7 0.3 
0.0 0.0 0.0 0.0 0.0 
<EndHMM> 



o 



O 



O 

% 



Fig. 7.2 Definition for Simple 
L-R HMM 



HMM definition files consist of a sequence of symbols representing the elements of a simple 
language. These symbols are mainly keywords written within angle brackets and integer and floating 
point numbers. The full HTK definition language is presented more formally later in section 7.10. 
For now, the main features of the language will be described by some examples. 
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Fig 7.2 shows a HMM definition corresponding to the simple left-right HMM illustrated in 
Fig 7.1. It is a continuous density HMM with 5 states in total, 3 of which are emitting. The first 
symbol in the file ~h indicates that the following string is the name of a macro of type h which 
means that it is a HMM definition (macros are explained in detail later). Thus, this definition 
describes a HMM called "hmml" . Note that HMM names should be composed of alphanumeric 
characters only and must not consist solely of numbers. The HMM definition itself is bracketed by 
the symbols <BeginHMM> and <EndHMM>. 

The first line of the definition proper specifies the global features of the HMM. In any system 
consisting of many HMMs, these features will be the same for all of them. In this case, the global 
definitions indicate tl^ti^the observation vectors have 4 components (<VecSize> 4) and that they 
denote MFCC coefficie^(<MFCC>). 

The next line specifies^e number of states in the HMM. There then follows a definition for each 
emitting state j, each of wftich has a single mean vector fij introduced by the keyword <Mean> 
and a diagonal variance vAitor Hj introduced by the keyword <Variance>. The definition ends 
with the transition matrix mtroduced by the keyword <TransP>. 

Notice that the dimension srf^ach vector or matrix is specified explicitly before listing the com- 
ponent values. These dimensio^s-^must be consistent with the corresponding observation width (in 
the case of output distribution paraj^ters) or number of states (in the case of transition matrices) . 
Although in this example they coujd be inferred, HTK requires that they are included explicitly 
since, as will be described shortly, @)by can be detached from the HMM definition and stored 
elsewhere as a macro. ^ 

The definition for hmml makes use o^^any defaults. In particular, there is no definition for the 
number of input data streams or for the'^nujaber of mixture components per output distribution. 
Hence, in both cases, a default of 1 is assumed. 

Fig 7.3 shows a HMM definition in which li^ emitting states are 2 component mixture Gaussians. 
The number of mixture components in each sfli^^ j is indicated by the keyword <NumMixes> and 
each mixture component is prefixed by the keywc^rfS <Mixture> followed by the component index m 
and component weight Cjm- Note that there is no renjiitement for the number of mixture components 
to be the same in each distribution. 

State definitions and the mixture components withjn them may be listed in any order. When 
a HMM definition is loaded, a check is made that all ^hfe^equired components have been defined. 
In addition, checks are made that the mixture componmc weights and each row of the transition 
matrix sum to one. If very rapid loading is required, this ^Visistency checking can be inhibited by 
setting the Boolean configuration variable CHKHMMDEFS to false. 

As an alternative to diagonal variance vectors, a Gaussiaa^ distribution can have a full rank 
covariance matrix. An example of this is shown in the definitisH' feu: hmm3 shown in Fig 7.4. Since 
covariance matrices are symmetric, they are stored in upper tJi^mgular form i.e. each row of the 
matrix starts at the diagonal element^. Also, covariance matrice^^e stored in their inverse form 
i.e. HMM definitions contain rather than S. To refiect this, tbSSkeyword chosen to introduce 
a full covariance matrix is <lnvCovar>. ^^J^ 

Covariance matrices are actually stored internally in lower triangular form y"^^ 



o 

% 
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"hmm2" 
<BeginHMM> 

<VecSize> 4 <MFCC> 
<NumStates> 4 
<State> 2 <NumMixes> 2 
<Mixture> 1 0.4 
<Mean> 4 

0.3 0.2 0.2 1.0 
<Variance> 4 

1.0 1.0 1.0 1.0 
<Mixture> 2 0.6 
<Mean> 4 

0.1 0.0 0.0 0.8 
<Variance> 4 

1.0 1.0 1.0 1.0 
r\) <State> 3 <NumMixes> 2 
<Mixture> 1 0.7 
>r <Mean> 4 
CO 0.1 0.2 0.6 1.4 

<Variance> 4 

1.0 1.0 1.0 1.0 
<Mi>;irure> 2 0.3 
^<>!lean> 4 

v/2.1 0.0 1.0 1.8 
<\^^)ance> 4 

1(^ 1.0 1.0 1.0 
<TransP> 4 y>' 
0.0 1.0 O.Q^.O 
0.0 0.5 0.5- OaO 
0.0 0.0 0.6 
0.0 0.0 0.0 o(p^ 
<EndHMM> 



Fig. 7.3 Simple Mixt 
Gaussian HMM 





(Uholeski decomposition 
Jly in upper triangular 
nee matrix by using 

-'O. jPkis macro is used 



Notice that only the second state has a full covariance Gaussian Kopiponent. The first state has 
a mixture of two diagonal variance Gaussian components. Again, tnisrillustrates the flexibility of 
HMM definition in HTK. If required the structure of every Gaussian can^fee individually configured 

Another possible way to store covariance information is in the form a 
L of the inverse covariance matrix i.e. = LL' . Again this is stored ext 
form so L' is actually stored. It is distinguished from the normal inverse 
the keyword <LLTCovar> in place of <lnvCovar>\ 

The definition for hmm3 also illustrates another macro type, that is 
as an alternative way of specifying global options and, in fact, it is the format ttsed by HTK tools 
when they write out a HMM definition. It is provided so that global options can Begpecifed ahead 
of any other HMM parameters. As will be seen later, this is useful when using many<ra)es of macro. 

As noted earlier, the observation vectors used to represent the speech signal can be divided into 
two or more statistically independent data streams. This corresponds to the splitting-up of the 
input speech vectors as described in section 5.13. In HMM definitions, the use of multiple data 
streams must be indicated by specifying the number of streams and the width (i.e dimension) of 
each stream as a global option. This is done using the keyword <Streamlnfo> followed by the 
number of streams, and then a sequence of numbers indicating the width of each stream. The sum 
of these stream widths must equal the original vector size as indicated by the <VecSize> keyword. 



^ The Choleski storage format is not used by default in HTK Version 2 
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~o <VecSize> 4 <MFCC> 
~h "hmmS" 
<BeginHMM> 

<NumStates> 4 
<State> 2 <NumMixes> 2 
<Mixture> 1 0.4 
<Mean> 4 

0.3 0.2 0.2 1.0 
<Variance> 4 

1.0 1.0 1.0 1.0 
<Mixture> 2 0.6 
<Mean> 4 

0.1 0.0 0.0 0.8 
<Variance> 4 

1.0 1.0 1.0 1.0 
^ <State> 3 <NumMixes> 1 
<Mean> 4 

0.10.20.61.4 
^^jL<lnvCovar> 4 
0) 1.00.10.00.0 
1.00.20.0 
y>' 1.00.1 

<TransP>p^4 

0.0 1^0^.0 0.0 
0.0 O.CO-5 0.0 
0.0 O.ON^e 0.4 

0.0 0.0 oxKo.o 

<EndHMM> 
vu> ^ 



Fig. 7.4 HMM with Fu^l^ovariance 

An example of a HMM definition for multiple data streMft^ is hmm4 shown in Fig 7.5. This 
HMM is intended to model 2 distinct streams, the first has j-ccmiponents and the second has 1. 
This is indicated by the global option <Streamlnfo> 2 3 1. T^^ definition of each state output 
distribution now includes means and variances for each individuaCijkream. 

Thus, in Fig 7.5, each state is subdivided into 2 streams using t^)<<Stream> keyword followed 
by the stream number. Note also, that each individual stream can be we^hted. In state 2 of hmm4, 
the vector following the <SWeights> keyword indicates that stream \JQ^ a weight of 0.9 whereas 
stream 2 has a weight of 1.1. There is no stream weight vector in stace^and hence the default 
weight of 1.0 will be assigned to each stream. \^ 



No HTK tools are supplied for estimating optimal stream weight values. Hence, they must 
either be set manually or derived from some outside source. However, once set, they<^e used in the 
calculation of output probabilities as specified in equations 7.1 and 7.3, and hence they will affect 
the operation of both the training and recognition tools. 




7.3 Macro Definitions 

So far, basic model definitions have been described in which all of the information required to 
define a HMM has been given directly between the <BeginHMM> and <EndHMM> keywords. As 
an alternative, HTK allows the internal parts of a definition to be written as separate units, possibly 
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o <VecSize> 4 <MFCC> 
<Streamlnfo> 2 3 1 
"hmm4" 
;inHMM> 
'^NumStates> 4 
v^ate> 2 
Cv<SWeights> 2 0.9 1.1 
v^tream> 1 
y>^Mean> 3 
\^ 0.2 0.1 0.1 
<Yariance> 3 
uVo 1.0 1.0 
<Strea^ 2 

<Mei^' 1 0.0 
<Variaric^> 1 4.0 
<State> 3 ^ > 
<Stream> lV\^ 
<Mean> 

0.3 0.2 ^ 
<Variance> 3 ^ 

1.0 1.0 i.qX 

<Stream> 2 ^ \ 

<Mean> 1 0.5 \ 
<Variance> 1 3.0 O 



<TransP> 4 

0.0 1.0 0.0 0.0 
0.0 0.6 0.4 0.0 
0.0 0.0 0.4 0.6 
0.0 0.0 0.0 0.0 
<EndHMM> 



Fig. 7.5 HMM with 2 Data Streams ^ 

% 
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in several different files, and then referenced by name wherever they are needed. Such definitions 
are called macros. 



-o <VecSize> 4 <MFCC> 

■^v "var" 

<Variance> 4 

1.0 1.0 1.0 1.0 



Fig. 7.6 Simple Macro 
Definitions 



HMM (~h) and global opSisd macros (~o) have already been described. In fact, these are 
both rather special cases since >r[either is ever referenced explicitly by another definition. Indeed, 
the option macro is unusual in t^ax since it is global and must be unique, it has no name. As an 
illustration of the use of macros, it m^y be observed that the variance vectors in the HMM definition 
hmm2 given in Fig 7.3 are all identid^ If this was intentional, then the variance vector could be 
defined as a macro as illustrated in FiserK.6. 



A macro definition consists of a mam* type indicator followed by a user-defined macro name. 
In this case, the indicator is ~v and the n^cme^is var. Notice that a global options macro is included 
before the definition for var. HTK must ki^' these before it can process any other definitions thus 
the first macro file specified on the commaJi^ine of any HTK tool must have the global options 
macro. Global options macro need not be rep^^ed at the head of every definition file, but it does 
no harm to do so. 



<1 



~h "hmm5" 
< Begin HMM> 

<NumStates> 4 r 
<State> 2 <Numl\^xes> 2 
<Mixture> 1 0.4^ 
<Mean> 4 , 

0.3 0.2 0.2 <^ 
^v "var" y 
<Mixture> 2 0.6 \ _ 
< 



lean> 4 O 
0.1 0.0 0.0 0.8 



^v var 
<State> 3 <NumMixes> 2 
<Mixture> 1 0.7 
<Mean> 4 

0.1 0.2 0.6 1.4 
~v "var" 
<Mixture> 2 0.3 
<Mean> 4 

2.1 0.0 1.0 1.8 
^v "var" 
<TransP> 4 

0.0 1.0 0.0 0.0 
0.0 0.5 0.5 0.0 
0.0 0.0 0.6 0.4 
0.0 0.0 0.0 0.0 
<EndHMM> 



o 



O 

% 



Fig. 7.7 A Definition Using 
Macros 
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Once defined, a macro is used simply by writing the type indicator and name exactly as written 
in the definition. Thus, for example. Fig 7.7 defines a HMM called hmm5 which uses the variance 
macro var but is otherwise identical to the earlier HMM definition hmm2. 

The definition for hmm5 can be understood by substituting the textual body of the var macro 
everywhere that it is referenced. Textually this would make the definition for hmm5 identical to 
that for hmm2, and indeed, if input to a recogniser, their effects would be similar. However, as will 
become clear in later chapters, the HMM definitions hmm2 and hmm5 differ in two ways. Firstly, 
if any attempt was made to re-estimate the parameters of hmm2, the values of the variance vectors 
would almost certainly diverge. However, the variance vectors of hmm5 are tied together and are 
guaranteed to remaii^j;lentical, even after re-estimation. Thus, in general, the use of a macro 
enforces a tying which j^ults in the corresponding parameters being shared amongst all the HMM 
structures which reference^hat macro. Secondly, when used in a recognition tool, the computation 
required to decode using ^MMs with tied parameters will often be reduced. This is particularly 
true when higher level pari* of a HMM definition are tied such as whole states. 

There are many different t^cip types. Some have special meanings but the following correspond 
to the various distinct points irt^Uie hierarchy of HMM parameters which can be tied. 

shared state distributi^ v 
shared Gaussian mixtuil^^mponent 
shared mean vector 
shared diagonal variance veoteu: 
shared inverse full covariance matrix 
shared choleski L' matrix 
shared arbitrary transform math^'^ 
shared transition matrix 
shared duration parameters 
shared stream weight vector 

Fig 7.8 illustrates these potential tie points graf^cally for the case of continuous density HMMs. 
In this figure, each solid black circle represents a pot(e^S};ial tie point, and the associated macro type 
is indicated alongside it. 




Stream 1 



Stream 2 Stream 3 



Fig. 7.8 HMM Hierarchy and Potential Tie Points 



The tie points for discrete HMMs are identical except that the macro types ^m, ~v, ~c, ~i and 
~u are not relevant and are therefore excluded. 
The macros with special meanings are as follows 

logical HMM ~h physical HMM 

global option values ~p tied mixture 
regression class tree linear transform 
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The distinction between logical and physical HMMs will be explained in the next section and option 
macros have already been described. The ^^p macro is used by the HMM editor HHEd for building 
tied mixture systems (see section 7.5). The ^1 or ~p macros are special in the sense that they are 
created implicitly in order to represent specific kinds of parameter sharing and they never occur 
explicitly in HMM definitions. 



7.4 HMM Sets 

The previous sections^ave described how a single HMM definition can be specified. However, many 
HTK tools require cojttniete model sets to be specified rather than just a single model. When this 
is the case, the indivi(I2eA HMMs which belong to the set are listed in a file rather than being 
enumerated explicitly orf^he command line. Thus, for example, a typical invocation of the tool 
HERest might be as follows 

HERest ... -H mf 1 hlist 

where each -H option names a macro file and hlist contains a list of HMM names, one per line. 
For example, it might contain 



ha (J) 



In a case such as this, the macro files would.Jh"brmally contain definitions for the models ha, hb and 
he, along with any lower level macro definitipiis that they might require. 

As an illustration. Fig 7.9 and Fig 7.10 giv<e examples of what the macro files mf 1 and mf 2 might 
contain. The first file contains definitions for wJrae. states and a transition matrix. The second file 
contains definitions for the three HMMs. In thi^example, each HMM shares the three states and 
the common transition matrix. A HMM set such i^Hliis is called a tied-state system. 

The order in which macro files are listed on the command line and the order of definition within 
each file must ensure that all macro definitions are defuSsd before they are referenced. Thus, macro 
files are typically organised such that all low level sVm^ures come first followed by states and 
transition matrices, with the actual HMM definitions ccknmg last. 

When the HMM list contains the name of a HMM for(^ich no corresponding macro has been 
defined, then an attempt is made to open a file with the samg name. This file is expected to contain 
a single definition corresponding to the required HMM. Thua<^he general mechanism for loading 
a set of HMMs is as shown in Fig 7.11. In this example, the HMIji list hlist contains the names 
of five HMMs of which only three have been predefined via theVnacro files. Hence, the remaining 
definitions are found in individual HMM definition files hd and hO 

When a large number of HMMs must be loaded from indivic)i^|)\files, it is common to store 
them in a specific directory. Most HTK tools allow this directory to -^J^specified explicitly using a 
command line option. For example, in the command V"^^ 

HERest -d hdir . . . hlist .... Q 

the definitions for the HMM listed in hlist will be searched for in the sub^rectory hdir. 

After loading each HMM set, HModel marks it as belonging to one of th* following categories 
(called the HSKind Q). 

• PLAINHS 

• SHAREDHS 



• TIEDHS 

• DISCRETEHS 

Any HMM set containing discrete output distributions is assigned to the DISCRETEHS category 
(see section 7.6). If all mixture components are tied, then it is assigned to the TIEDHS category 
(see section 7.5). If it contains any shared states ('^s macros) or Gaussians (~m macros) then it 
is SHAREDHS. Otherwise, it is PLAINHS. The category assigned to a HMM set determines which of 
several possible optimisations the various HTK tools can apply to it. As a check, the required kind 
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0.1 

4 

1.0 



0.9 
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Fig. 7.9 File mfl: shared sl^^(^and transition matrix macros 

of a HMM set can also be set via the configuration variable HMMSETKIND. For debugging purposes, 
this can also be used to re-categorise a SHAREDHS system as PLAINHS. 

As shown in Figure 7.8, complete HMM definiticelsvcan be tied as well as their individual 
parameters. However, tying at the HMM level is define(Cin a different way. HMM lists have so far 
been described as simply a list of model names. In fact, e\^y HMM has two names: a logical name 
and a physical name. The logical name reflects the role of the model and the physical name is used 
to identify the definition on disk. By default, the logical ancL{5nysical names are identical. HMM 
tying is implemented by letting several logically distinct HMMrsh^re the same physical definition. 
This is done by giving an explicit physical name immediately an^ the logical name in a HMM list. 



O 



o 



o 

% 
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-h "ha" 
<BeginHMM> 

<NumStates> 5 
<State> 2 
>r . ~s "stateA" 

<State> 3 
^ "stateB" 
<State> 4 
^ ~s "stateB" 

<Begin^^M> 

<Nupt?^tates> 5 
< State >^ 2 

-s\stateB" 
<State>C3*' 

"st^f^" 
<State> 4 ^ 
"stateCjj^ 



"tran" 
<'5f!'dHMM> 



~t "tran" 
<EndHMM> 

-h "he" 

<BeginHMM> 

<NumStates> 
<State> 2 



. O 



-s "stateC" V5\. 
<State> 3 vO 



-'S "stateC" 
<State> 4 

"stateB" 
'--^t "tran" 

<EndHMM> Q. 



Fig. 7.10 Simple Tied-State System ^ 
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Fig. 7.11 pMning a Model Set 

For example, in the HMM list shown in FigV.12^the logical HMMs two, too and to are tied 
and share the same physical HMM definition tuw. ^^e HMMs one and won are also tied but in this 
case won shares one's definition. There is, howeveri^^^ subtle distinction here. The two different 
cases are given just to emphasise that the names used^Jpr the logical and physical HMMs can be 
the same or different, as is convenient. Finally, in thffi\^ample, the models three and four are 
untied. ^ ^ 



two 


tuw 


too 


tuw 


to 


tuw 


one 




won 


one 


three 




four 





•1^ 

o 

®< 

Fig. 7.12 HMM List with Tying 



This mechanism is implemented internally by creating a ^1 macro definitioi^^r every HMM in 
the HMM list. If an explicit physical HMM is also given in the list, then the logic^^MM is linked 
to that macro, otherwise a ~h macro is created with the same name as the ~l ma^fji. Notice that 
this is one case where the "define before use" rule is relaxed. If an undefined ~h «b encountered 
then a dummy place-holder is created for it and, as explained above, HModel subsequently tries 
to find a HMM definition file of the same name. 

Finally it should be noted that in earlier versions of HTK, there were no HMM macros. However, 
HMM definitions could be listed in a single master macro file or MMF. Each HMM definition began 
with its name written as a quoted string and ended with a period written on its own (just like master 
label files), and the first hue of an MMF contained the string #!MMF!#. In HTK 3.4, the use of 
MMFs has been subsumed within the general macro definition facility using the ^^h type. However, 
for compatibility, the older MMF style of file can still be read by all HTK tools. 
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7.5 Tied-Mixture Systems 

A Tied-Mixture System is one in which all Gaussian components are stored in a pool and all state 
output distributions share this pool. Fig 7.1.3 illustrates this for the case of single data stream. 
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Tied-Mixture Codebook 
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13 Tied Mixture System 



Each state output distribution is defeied by M mixture component weights and since all states 
share the same components, all of thivs^te-specific discrimination is encapsulated within these 
weights. The set of Gaussian components selected for the pool should be representative of the 
acoustic space covered by the feature vect^s. To keep M manageable, multiple data streams are 
typically used with tied-mixture systems. Fop^ample, static parameters may be in one stream and 
delta parameters in another (see section 5.13V fiach stream then has a separate pool of Gaussians 
which are often referred to as codebooks. ^ v- 

More formally, for S independent data streaiBS^^he output distribution for state j is defined as 



bAot) = n 



■ Ms 
ni—l 



(7.4) 



where the notation is identical to that used in equati^^7.1. Note however that this equation 
differs from equation 7.1 in that the Gaussian componen^^rameters and the number of mixture 
components per stream are state independent. ^ 

Tied-mixture systems lack the modelling accuracy of fully Gemtinuous density systems. However, 
they can often be implemented more efficiently since the total number of Gaussians which must be 
evaluated at each input frame is independent of the number of Sctive HMM states and is typically 
much smaller. O 

A tied-mixture HMM system in HTK is defined by representii,i^^e pool of shared Gaussians 
as ~m macros with names "xxxl", "xxx2", . . . , "xxxM" where "xxxj^^ an arbitrary name. Each 
HMM state definition is then specified by giving the name "xxx" follOTie^ by a list of the mixture 
weights. Multiple streams are identified using the <Stream> keyword a^-pescribed previously. 

As an example. Fig 7.14 shows a set of macro definitions which specify^SjGaussian component 
tied-mixture pool. (^^ 

Fig 7.17 then shows a typical tied-mixture HMM definition which uses ^lis pool. As can be 
seen, the mixture component weights are represented an array of real numbers asxn the continuous 
density case. 

The number of components in each tied-mixture codebook is typically of tHfe^arder of 2 or 3 
hundred. Hence, the list of mixture weights in each state is often long with mas^values being 
repeated, particularly fioor values. To allow more efficient coding, successive identical values can 
be represented as a single value plus a repeat count in the form of an asterix followed by an integer 
multiplier. For example. Fig 7.15 shows the same HMM definition as above but using repeat counts. 
When HTK writes out a tied-mixture definition, it uses repeat counts wherever possible. 



7.6 Discrete Probability HMMs 

Discrete probability HMMs model observation sequences which consist of symbols drawn from a 
discrete and finite set of size M. As in the case of tied-mixture systems described above, this set is 
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Fig. 7fi 
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~h "htm" • 
<BeginHMM>V\ 
<NumState^4 
<State> 2 <ij^^iim Mixes > 5 

<TMix> mix^.2 0.1 0.3*2 
<State> 3 <NLhTiMkes> 5 

<TMix> mix 0.3 0.1*3 
<TransP> 4 O 

<EndHMM> ^ 
Fig. 7.15 HMM using Repeat Counts 
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in equation 7.3. It consists of 
,ch symbol is identified by an 
be determined by a simple 



often referred to as a codebook. 

The form of the output distributions in a discrete HMM was 
a table giving the probability of each possible observation symbol 
index in the range 1 to AI and hence the probability of any symb' 
table look-up operation. ^ 

For speech applications, the observation symbols are generated ^^3^^ vector quantiser which 
typically associates a prototype speech vector with each codebook symbo^^Each incoming speech 
vector is then represented by the symbol whose associated prototype is cKJeest. The prototypes 
themselves are chosen to cover the acoustic space and they are usually calisulated by clustering a 
representative sample of speech vectors. • 

In HTK, discrete HMMs are specified using a very similar notation to that u^e^ for tied-mixture 
HMMs. A discrete HMM can have multiple data streams but the width of eaclQtream must be 
1. The output probabilities are stored as logs in a scaled integer format such tha,b^ is the 

stored discrete probability for symbol v in stream s of state j, the true probability is given by 



Pjs[w] = exp{—djs[v]/2371. 



(7.5) 



Storage in the form of scaled logs allows discrete probability HMMs to be implemented very effi- 
ciently since HTK tools mostly use log arithmetic and direct storage in log form avoids the need 
for a run-time conversion. The range determined by the constant 2371.8 was selected to enable 
probabihties from 1.0 down to 0.000001 to be stored. 



As an example, Fig 7.18 shows the definition of a discrete HMM called dhmml. As can be seen, 
this has two streams. The codebook for stream 1 is size 10 and for stream 2, it is size 2. For 
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ig. 7.16 Input Linear Transform 



consistency with the representatisij used for continuous density HMMs, these sizes are encoded in 
the <NumMixes> specifier. \^ 

7.7 Input Linear Transforms 

When reading feature vectors from filesv^TJC will coerce them to the TARGETKIND specified in 
the config file. Often the TARGETKIND wil^contain certain qualifiers (specifying for example delta 
parameters). In addition to this parameter c'?5^cion it is possible to apply a linear transform before, 
or after, appending delta, acceleration and thifd^derivative parameters. 

6 . 

Figure 7.16 shows an example linear transform. yHie <PreQual> keyword specifies that the linear 
transform is to be applied before the delta and delta^-^elta parameters specified in TARGETKIND are 
added. The default mode, no <PreQual> keyword, applies the linear transform after the addition 
of the quahfiers. ^^Jx 

The linear transform fully supports projection froi^ nigher number of features to a smaller 
number of features. In the example, the parameterised d^^ must consist of 5 MFCC parameters^. 
The model sets that are generated using this transform have a vector size of 2. 

By default the linear transform is stored with the HlVJMr This is achieved by adding the 
<lnputXform> keyword and specifying the transform or macrenasie. To allow compatibilty with 
tools only supporting the old format models it is possible to specify that no linear transform is to 
be stored with the model. (_) 



# Do not store linear trcoisform 
HMODEL: SAVEINPUTXFORM = FALSE 



In addition it is possible to specify the linear transform as a HPA 
MATRTRANFN. 

# Specifying an input linear transform 
HP ARM: MATTRANFN = /home/test/lintran.mat 




configuration variable, 

O 

When a linear transform is specified in this form it is not necessary to have a i^^roname linked 
with it. In this case the filename will be used as the macroname (having stripp^^jthe directory 
name) 



7.8 Tee Models 

Normally, the transition probability from the non-emitting entry state to the non-emitting exit 
state of a HMM will be zero to ensure that the HMM aligns with at least one observation vector. 
Models which have a non-zero entry to exit transition probability are referred to as tee-models. 

Tee-models are useful for modelling optional transient effects such as short pauses and noise 
bursts, particularly between words. 



^If CO or normalised log-energy are added these will be stripped prior to applying the linear transform 
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--h "htm" 
<BeginHMM> 

<NumStates> 4 

<State> 2 <NumMixes> 5 

<TMix> mix 0.2 0.1 0.3 0.3 0.1 
<State> 3 <NumMixes> 5 

<TMix> mix 0.4 0.3 0.1 0.1 0.1 
<TransP> 4 

0.0 1.0 0.0 0.0 
0.0 0.5 0.5 0.0 
0.0 0.0 0.6 0.4 
0.0 0.0 0.0 0.0 



<E 



MM> 



7.17 Tied-Mixture HMM 



Although most HTK tools supp(m^ee-models, they are incompatible with those that work with 
isolated models such as HInit and H^est. When a tee-model is loaded into one of these tools, 
its entry to exit transition probability(i^ reset to zero and the first row of its transition matrix is 
renormalised. "^j 

V- 



7.9 Binary Storage Format^ 



Throughout this chapter, a text-based representatjxijji has been used for the external storage of 
HMM definitions. For experimental work, text-l5l^ed storage allows simple and direct access to 
HMM parameters and this can be invaluable. Howe^CT, when using very large HMM sets, storage 
in text form is less practical since it is inefficient in iij^jjse of memory and the time taken to load 
can be excessive due to the large number of character to^oat conversions needed. 

To solve these problems, HTK also provides a binary storage format. In binary mode, keywords 
are written as a single colon followed by an 8 bit codesrepresenting the actual keyword. Any 
subsequent numerical information following the keyword is ♦heain binary. Integers are written as 
16-bit shorts and all floating-point numbers are written as 32-^1^ single precision floats. The repeat 
factor used in the run- length encoding scheme for tied- mixture ^ajM discrete HMMs is written as a 
single byte. Its presence immediately after a 16-bit discrete log payability is indicated by setting 
the top bit to 1 (this is the reason why the range of discrete log probabilities is limited to 0 to 32767 
i.e. only 15 bits arc used for the actual value). For tied-mixtures,\y**rrepeat count is signalled by 
subtracting 2.0 from the weight. \^ 

Binary storage format and text storage format can be mixed within ^j^jsetween input flies. Each 
time a keyword is encountered, its coding is used to determine whether ^tns subsequent numerical 
information should be input in text or binary form. This means, for exanimpfthat binary flies can 
be manually patched by replacing a binary- format deflnition by a text fornmt deflnition'\ 

HTK tools provide a standard command line option (-B) to indicate tkat HMM deflnitions 
should be output in binary format. Alternatively, the Boolean conflguration v^ifoble SAVEBINARY 
can be set to true to force binary format output. 



7.10 The HMM Definition Language 



To conclude this chapter, this section presents a formal description of the HMM deflnition language 
used by HTK. Syntax is described using an extended BNF notation in which alternatives are 
separated by a vertical bar |, parentheses () denote factoring, brackets [ ] denote options, and 
braces {} denote zero or more repetitions. 



®The fact that this is possible does not mean that it is recommended practice! 
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<DISCRETE> <Streamlnfo> 2 11 
~h "dhmml" 
<BeginHMM> 

<NumStates> 4 
<State> 2 

<NumMixes> 10 2 
<SWeights> 2 0.9 1.1 
<Stream> 1 

<DProb> 3288*4 32767*6 
<Stream> 2 

<DProb> 1644*2 
<State> 3 

<NumMixes> 10 2 
<SWeights> 2 0.9 1.1 
^ <Stream> 1 

<DProb> 5461*10 
y^<Stream> 2 

<DProb> 1644*2 
<T>9fcP> 4 

oQyi.o 0.0 0.0 

O.fT^.S 0.5 0.0 
0 . 0 OAT 0.6 0.4 

0.0 0.0 

<EnclHMM> '^^^ 
^ 



Fig. 7.18 DiscretC^robability HMM 

All keywords are enclosed in angle brackets^ and riiaicase of the keyword name is not significant. 
White space is not significant except within double-quoted, strings. 

The top level structure of a HMM definition is show^ by the following rule. 



hmmdef = 



[ macro ] 
<BeginHMM> 

[ globalOpts ] 

<NumStates> short 

state { state } 

transP 

[ duration ] 
<EndHMM> 



O 



A HMM definition consists of an optional set of global options followed by the <NumStates> 
keyword whose following argument specifies the number of states in the ni»<tel. inclusive of the non- 
emitting entry and exit states*^. The information for each state is then gi\^n in turn, followed by 
the parameters of the transition matrix and the model duration parameters,* if any. The name of 
the HMM is given by the ~h macro. If the HMM is the only definition within jtHBile, the ^^h macro 



.'h macro. If the HMM is the only definition within ^fT^e, the 
name can be omitted and the HMM name is assumed to be the same as the file wasjie 

The global options are common to all HMMs. They can be given separately u 
macro 




o option 



optmacro 



-•o globalOpts 



or they can be included in one or more HMM definitions. Global options may be repeated but no 
definition can change a previous definition. All global options must be defined before any other 
macro definition is processed. In practice this means that any HMM system which uses parameter 
tying must have a ^o option macro at the head of the first macro file processed. 



This definition covers the textual version only. The syntax for the binary format is identical apart from the way 
that the lexical items are encoded. 

* Integer numbers are specified as either char or short. This has no effect on text-based definitions but for binary 
format it indicates the underlying C type used to represent the number. 



7.10 The HMM Definition Language 



115 



The full set of global options is given below. Every HMM set must define the vector size (via 
<VecSize>), the stream widths (via <Streamlnfo>) and the observation parameter kind. However, 
if only the stream widths are given, then the vector size will be inferred. If only the vector size is 
given, then a single stream of identical width will be assumed. All other options default to null. 

globalOpts = option { option } 
option = <HmmSetld> string | 

<Streamlnfo> short { short } | 
<VecSize> short | 
:ProjSize> short | 
i|nputXform> inputXform | 
3rentXform> ~a macro | 
co^d I 
durk^nd | 
parmk^^ 

The <HmmSetld> option alloXTO)the user to give the MMF an identifier. This is used as a san- 
ity check to make sure that a^MF can be safely applied to this MMF. The arguments to the 
<Streamlnfo> option are the nunK>ef of streams (default 1) and then for each stream, the width 
of that stream. The <VecSize> opQmi gives the total number of elements in each input vector. 
<ProjSize> is the number of "nuisancK" dimensions removed using, for example, an HLDA trans- 
form. The <ParentXForm> allows the i-tied macro, if any, associated with the model-set to be 
specified. If both <VecSize> and <Strea(i^)ifo> are included then the sum of all the stream widths 
must equal the input vector size. \' 

The covkind defines the kind of the covarjance matrix 

covkind = <DiagC> | <lnvDiagCS d^<FullC> | 
<LLTC> I <XformC> 

where <lnvDiagC> is used internally. <LLTC> ai^<cXformC> are not used in HTK Version 3.4. 
Setting the covariance kind as a global option forces aH^components to have this kind. In particular, 
it prevents mixing full and diagonal covariances witnin^ HMM set. 

The durkind denotes the type of duration model us^<f^cording to the following rules 

durkind = <nullD> | <poissonD> | <gamma[^^ | <genD> 

For anything other than <nullD>, a duration vector must \^ supplied for the model or each state 
as described below. Note that no current HTK tool can estinoafe or use such duration vectors. 
The parameter kind is any legal parameter kind including qualified forms (see section 5.1) 

parmkind = <basekind{_D|_A|_T|_E|_N|_Z|_0|_V|_C|_Kf:Q 
basekind = <discrete>|<lpc>|<lpcepstra>|<mfcc> | <Tba«k> | 
<melspec>| <lprefc>|<lpdelcep> | <user> V^"^^ 

where the syntax rule for parmkind is non-standard in that no spacgp^re allowed between the 
base kind and any subsequent qualifiers. As noted in chapter 5, <lpdelTfeD> is provided only for 
compatibility with earlier versions of HTK and its further use should be awided. 

Each state of each HMM must have its own section defining the paramet^& associated with that 
state , 

state — <State: Exp > short stateinfo o 

where the short following <State: Exp > is the state number. State information caf*\be defined in 
any order. The syntax is as follows 

stateinfo — macro | 

[ mixes ] [ weights ] stream { stream } [ duration ] 
macro — string 

A stateinfo definition consists of an optional specification of the number of mixture components, 
an optional set of stream weights, followed by a block of information for each stream, optionally 
terminated with a duration vector. Alternatively, ^s macro can be written where macro is the name 
of a previously defined macro. 

The optional mixes in a stateinfo definition specify the number of mixture components (or discrete 
codebook size) for each stream of that state 
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mixes = <NumMixes> short {short} 

where there should be one short for each stream. If this specification is omitted, it is assumed that 
all streams have just one mixture component. 

The optional weights in a stateinfo definition define a set of exponent weights for each independent 
data stream. The syntax is 

weights = macro | <SWeights> short vector 

vector = float { float } 

where the short gives l^^number S of weights (which should match the value given in the <Streamlnfo> 
option) and the vector gdntains the S stream weights 7s (see section 7.1). 

The definition of eadff^atream depends on the kind of HMM set. In the normal case, it consists 
of a sequence of mixture ^rhiponent definitions optionally preceded by the stream number. If the 
stream number is omitted 4;hen it is assumed to be 1. For tied-mixture and discrete HMM sets, 
special forms are used. \^ 

stream = [ <Strea^> short ] 

(mixture -jC^ixJure } | tmixpdf | discpdf) 

The definition of each mixture c>?5ranoncnt consists of a Gaussian pdf optionally preceded by the 
mixture number and its weight 

CO 

mixture — [ <Mixture> shor^^^at ] mixpdf 

If the <Mixture> part is missing then mijj^-e 1 is assumed and the weight defaults to 1.0. 

The tmixpdf option is used only for fully ^it^d mixture sets. Since the mixpdf parts are all macros 
in a tied mixture system and since they are idm^cal for every stream and state, it is only necessary 
to know the mixture weights. The tmixpdf syiitex allows these to be specified in the following 
compact form \ 



tmixpdf = <TMix> macro weightList 
repShort = short [ * char ] 



weightList — repShort { repShort } 



where each short is a mixture component weight scaled si^hat a weight of 1.0 is represented by 
the integer 32767. The optional asterix followed by a chaVis used to indicate a repeat count. For 
example, 0*5 is equivalent to 5 zeroes. The Gaussians whiclPmake-up the pool of tied-mixtures are 
defined using ~m macros called macrol, macro2, macro3, etc.\_) 

Discrete probability HMMs are defined in a similar way 

discpdf — <DProb> weightList o 

The only difference is that the weights in the weightList are scale)^^a.probabilities as defined in 
section 7.6. \y\ 

The definition of a Gaussian pdf requires the mean vector to be ^JveH and one of the possible 
forms of covariance (3 



mixpdf = 


macro | mean cov [ <GConst> float ] 


mean = 


^u macro | <Mean> short vector 


cov — 


var 1 inv 1 xform 


var = 


~v macro | <Variance> short vector 


inv = 


~i macro | 




(<lnvCovar> | <LLTCovar>) short tmatrix 


xform = 


~x macro | <Xform> short short matrix 


matrix — 


float {float} 


tmatrix — 


matrix 



o 

% 



In mean and var, the short preceding the vector defines the length of the vector, in inv the short 
preceding the tmatrix gives the size of this square upper triangular matrix, and in xform the two 
short's preceding the matrix give the number of rows and columns. The optional <GConst>" gives 

^specifically, in equation 7.2 the GCONST value seen in HMM sets is calculated by multiplying the determinant 
of the covariance matrix by (27r)"' 
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that part of the log probability of a Gaussian that can be precomputed. If it is omitted, then it will 
be computed during load-in, including it simply saves some time. HTK tools which output HMM 
definitions always include this field. 

In addition to defining the output distributions, a state can have a duration probability distri- 
bution defined for it. However, no current HTK tool can estimate or use these. 

duration — ~d macro | <Duration> short vector 

Alternatively, as shown by the top level syntax for a hmmdef, duration parameters can be specified 
for a whole model. i 

The transition m®jK is defined by 

transP = '^^nacro | <TransP> short matrix 

where the short in this case should be equal to the number of states in the model. 

To support HMM adaptation (as described in chapter 9) baseclasses and regression class trees 
are defined. A baseclass is deffned as 



baseClass = ^b macro^^seopts classes 

baseopts = <MMFIdlwa^k^ string <Parameters> baseKind <NumClasses> int 

baseKind = MIXBASE | VEANBASE | COVBASE 

classes = <Class> int il^list { classes } 

% . ^ 

where itemlist is a list of mixture compotrents specified using the same conventions as the HHEd 
command described in section 10. .3. A regJi^sion class tree may also exist for an HMM set. This is 
defined by 

regTree — ~r macro <BaseClass>^k£Wclasses node 
baseclasses — macro | baseopts class^ 

node = (<Node> int int int { int }\^'TNode> int int int { int }) { node } 

For the definition of a node (<Node>) in node theC^rst integrer is the node number, the second 
the number of children followed the of children node mitabers^". The integers in the definition of 
a terminal node (<TNode>) define the node number, ]^(&ber of base classes associated with the 
teminal and the base class-index numbers. /V\ 
Adaptation transforms are defined using 

adaptXForm = ~a macro adaptOpts <XformSet> xfon^irj^t 

adaptOpts = <AdaptKind> adaptkind <BaseClass> bas^asses [<ParentXForm> parentxform] 

parentxform = ~a macro | adaptOpts <XformSet> xformset-. 

adaptKind = TREE | BASE U 

xformset = <XFormKind> xformKind <NumXForms> it\^)^inxform } 

xformKind = MLLRMEAN | MLLRCOV | MLLRVAR | CMLLI^EMIT 

linxform = <LinXForm> int <VecSize> int [<OFFSET> x^>|^f^ias] 

<Blocklnfo> int int {int} block {block} 
xformbias = ~y macro | <Bias> short vector ^^'v- 
block = <Block> int xform \ 

• 

In the definition of the <BlockInfo> the first integer is the number of blocks, ibupwed the size of 
each of the clocks. For examples of the adaptation transform format see section SL^- 
Finally the input transform is defined by 

inputXform — ~j macro | inhead inmatrix 

inhead = <MMFIdMask> string parmkind [<PreQual>] 

inmatrix — <LinXform> <VecSize> int <Blocklnfo> int int {int} block {block} 

block = <Block> int xform 

where the short following <VecSize> is the number of dimensions after applyingthe linear transform 
and must match the vector size of the HMM definition. The first short after <Blocklnfo> is the 
number of block, this is followed by the number of output dimensions from each of the blocks. 



^"Though the notation support n-ary trees, the regression class tree eode can only generate binary regression class 
trees. 
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In chapter 7 the various types of HMM were described the way in which they are represented 
within HTK was explained. Defining the structure and overa^J form of a set of HMMs is the first step 
towards building a recogniser. The second step is to estimate'^e parameters of the HMMs from 
examples of the data sequences that they are intended to modelT This process of parameter estima- 
tion is usually called training. HTK supplies four basic tools fonparameter estimation: HCompV, 
HInit, HRest and HERest. HCompV and HInit are used forViiitiaHsation. HCompV will set 
the mean and variance of every Gaussian component in a HMM de^^jtion to be equal to the global 
mean and variance of the speech training data. This is typically usec^^ an initialisation stage for 
flat-start training. Alternatively, a more detailed initialisation is possJa^e using HInit which will 
compute the parameters of a new HMM using a Viterbi style of estimaram. 

HRest and HERest are used to refine the parameters of existing HMMs using Baum- Welch 
Re-estimation. Like HInit, HRest performs isolated-unit training wherea^ HERest operates on 
complete model sets and performs embedded-unit training. In general, whole "yord HMMs are built 
using HInit and HRest, and continuous speech sub-word based systems are bfflit using HERest 
initiahsed by either HCompV or HInit and HRest. \^ 

This chapter describes these training tools and their use for estimating the paraBaeters of plain 
(i.e. untied) continuous density HMMs. The use of tying and special cases such ^ptied-mixture 
HMM sets and discrete probality HMMs are dealt with in later chapters. The first section of 
this chapter gives an overview of the various training strategies possible with HTK. This is then 
followed by sections covering initialisation, isolated-unit training, and embedded training. The 
chapter concludes with a section detailing the various formulae used by the training tools. 



8.1 Training Strategies 

As indicated in the introduction above, the basic operation of the HTK training tools involves 
reading in a set of one or more HMM definitions, and then using speech data to estimate the 
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parameters of these definitions. The speech data files are normally stored in parameterised form 
such as LPC or MFCC parameters. However, additional parameters such as delta coefficients are 
normally computed on-the-fly whilst loading each file. 



Unlabelled Tokens 




Fig. 



Whole Word 
8.1 Isolated W^^rd Training 



In fact, it is also possible to use waveform data dire by performing the full parameter con- 
version on-the-fly. Which approach is preferred depends os^wie available computing resources. The 
advantages of storing the data already encoded are that the ^ata is more compact in parameterised 
form and pre-encoding avoids wasting compute time converti*[^the data each time that it is read 
in. However, if the training data is derived from CD-ROMS anolh^y can be accessed automatically 
on-line, then the extra compute may be worth the saving in magnetic disk storage. 

The methods for configuring speech data input to HTK tools Were described in detail in chap- 
ter 5. All of the various input mechanisms are supported by the training tools except direct 
audio input. \^ 

The precise way in which the training tools are used depends oij^^ftie type of HMM system 
to be built and the form of the available training data. FurthermoreT^HTK tools are designed 
to interface cleanly to each other, so a large number of configurations ase oossible. In practice, 
however, HMM-based speech recognisers are either whole-word or sub-word!^ 

As the name suggests, whole word modelling refers to a technique whereby»each individual word 
in the system vocabulary is modelled by a single HMM. As shown in Fig. 8.1, wh/fl^word HMMs are 
most commonly trained on examples of each word spoken in isolation. If these ipetining examples, 
which are often called tokens, have had leading and trailing silence removed, then can be input 
directly into the training tools without the need for any label information. The'^ost common 
method of building whole word HMMs is to firstly use HInit to calculate initial parameters for the 
model and then use HRest to refine the parameters using Baum- Welch re-estimation. Where there 
is limited training data and recognition in adverse noise environments is needed, so-called fixed 
variance models can offer improved robustness. These are models in which all the variances are set 
equal to the global speech variance and never subsequently re-estimated. The tool HCompV can 
be used to compute this global variance. 
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Although HTK gives full support for building whole- word HMM systems, the bulk of its facilities 
are focussed on building sub- word systems in which the basi? ug^ts are the individual sounds of the 
language called phones. One HMM is constructed for each Sam phone and continuous speech is 
recognised by joining the phones together to make any require^^rocabulary using a pronunciation 
dictionary. 

The basic procedures involved in training a set of subword model* are shown in Fig. 8.2. The 
core process involves the embedded training tool HERest. HEHEsa: uses continuously spoken 
utterances as its source of training data and simultaneously re-estimates'me complete set of subword 
HMMs. For each input utterance, HERest needs a transcription i.eXaJlst of the phones in that 
utterance. HERest then joins together all of the subword HMMs corresj^o^ding to this phone list 
to make a single composite HMM. This composite HMM is used to collect ^t^{e necessary statistics 
for the re-estimation. When all of the training utterances have been processed, the total set of 
accumulated statistics are used to re-estimate the parameters of all of the 'phone HMMs. It is 
important to emphasise that in the above process, the transcriptions are onljMifeeded to identify 
the sequence of phones in each utterance. No phone boundary information is ne 

The initialisation of a set of phone HMMs prior to embedded re-estimation usIJ^JIERest can 
be achieved in two different ways. As shown on the left of Fig. 8.2, a small set ofnand-labelled 
bootstrap training data can be used along with the isolated training tools HInit and HRest to 
initialise each phone HMM individually. When used in this way, both HInit and HRest use the 
label information to extract all the segments of speech corresponding to the current phone HMM 
in order to perform isolated word training. 

A simpler initialisation procedure uses HCompV to assign the global speech mean and variance 
to every Gaussian distribution in every phone HMM. This so-called flat start procedure implies 
that during the first cycle of embedded re-estimation, each training utterance will be uniformly 
segmented. The hope then is that enough of the phone models align with actual realisations of that 
phone so that on the second and subsequent iterations, the models align as intended. 
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One of the major problems to be faced in building any HMM-based system is that the amount of 
training data for each model will be variable and is rarely sufficient. To overcome this, HTK allows 
a variety of sharing mechanisms to be implemented whereby HMM parameters are tied together so 
that the training data is pooled and more robust estimates result. These tyings, along with a variety 
of other manipulations, are performed using the HTK HMM editor HHEd. The use of HHEd is 
described in a later chapter. Here it is sufficient to note that a phone-based HMM set typically 
goes through several refinement cycles of editing using HHEd followed by parameter re-estimation 
using HERest before the final model set is obtained. 

Having described in outline the main training strategies, each of the above procedures will be 
described in more dewil. 

\ 

8.2 Initialisati^ using HInit 

In order to create a HMM definition, it is first necessary to produce a prototype definition. As 
explained in Chapter 7, HMlVrdefinitions can be stored as a text file and hence the simplest way of 
creating a prototype is by using^ text editor to manually produce a definition of the form shown 
in Fig 7.2, Fig 7.3 etc. The funcC^ii of a prototype definition is to describe the form and topology 
of the HMM, the actual numbers ti^d in the definition are not important. Hence, the vector size 
and parameter kind should be speciti^and the number of states chosen. The allowable transitions 
between states should be indicated by wnitting non-zero values in the corresponding elements of the 
transition matrix and zeros else where, ^yje rows of the transition matrix must sum to one except 
for the final row which should be all zercv'^Each state definition should show the required number 
of streams and mixture components in e^' stream. All mean values can be zero but diagonal 
variances should be positive and covariance<^)atrices should have positive diagonal elements. All 
state definitions can be identical. ~ 



Having set up an appropriate prototype, a TlMM can be initialised using the HTKtool HInit. 
The basic principle of HInit depends on the coBcept of a HMM as a generator of speech vectors. 
Every training example can be viewed as the oils^ut of the HMM whose parameters are to be 
estimated. Thus, if the state that generated each ve{ctt)r in the training data was known, then the 
unknown means and variances could be estimated by ^(waging all the vectors associated with each 
state. Similarly, the transition matrix could be estimefted^by simply counting the number of time 
slots that each state was occupied. This process is descrtbed more formally in section 8.8 below. 
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The above idea can be implemented by an iterative schejaii^as shown in Fig 8.3. Firstly, the 
Viterbi algorithm is used to find the most likely state sequeTK;e>corresponding to each training 
example, then the HMM parameters are estimated. As a side^ffect of finding the Viterbi state 
alignment, the log likelihood of the training data can be computCd^ Hence, the whole estimation 
process can be repeated until no further increase in likelihood is o^^)oed. 

This process requires some initial HMM parameters to get started^J^ circumvent this problem, 
HInit starts by uniformly segmenting the data and associating each sucj^afeive segment with succes- 
sive states. Of course, this only makes sense if the HMM is left-right, irtiifi HMM is ergodic, then 
the uniform segmentation can be disabled and some other approach takenVJbj: example, HCompV 
can be used as described below. (^^ 

If any HMM state has multiple mixture components, then the training ^^ctors are associated 
with the mixture component with the highest likelihood. The number of veciiOrs associated with 
each component within a state can then be used to estimate the mixture weigntsrxln the uniform 
segmentation stage, a K-means clustering algorithm is used to cluster the vectors Wiin each state. 

Turning now to the practical use of HInit, whole word models can be initialis^a by typing a 
command of the form 



HInit hmm datal data2 data3 



where hmm is the name of the file holding the prototype HMM and datal, data2, etc. are the names 
of the speech files holding the training examples, each file holding a single example with no leading 
or trailing silence. The HMM definition can be distributed across a number of macro files loaded 
using the standard -H option. For example, in 

HInit -H macl -H mac2 hmm datal data2 dataS . . . 
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then the macro files macl and mac2 would be loaded first. If these contained a definition for hmm, 
then no further HMM definition input would be attempted. If however, they did not contain a 
definition for hmm, then HInit would attempt to open a file called hmm and would expect to find a 
definition for hmm within it. HiNiT can in principle load a large set of HMM definitions, but it will 
only update the parameters of the single named HMM. On completion, HInit will write out new 
versions of all HMM definitions loaded on start-up. The default behaviour is to write these to the 
current directory which has the usually undesirable effect of overwriting the prototype definition. 
This can be prevented by specifying a new directory for the output definitions using the -M option. 
Thus, typical usage of HInit takes the form 

proto datal data2 dataS . . . 



HInit -H globats>i-M dirl 
mv dirl /proto dS^/wordX 



Here globals is assumed to hold a global options macro (and possibly others). The actual HMM 
definition is loaded from the file proto in the current directory and the newly initialised definition 
along with a copy of globals^wiil be written to dirl. Since the newly created HMM will still be 
called proto, it is renamed as Mjpropriate. 

For most real tasks, the num^igi- of data files required will exceed the command line argument 
limit and a script file is used inst^eCa. Hence, if the names of the data files are stored in the file 
trainlist then typing \J0 

HInit -S trainlist -H global^)-M dirl proto 

V 



would have the same effect as previously .\ 
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When building sub-word models, HInit can be used in the same manner as a^^k to initialise 
each individual sub-word HMM. However, in this case, the training data is typically continuous 
speech with associated label files identifying the speech segments corresponding to each sub-word. 
To illustrate this, the following command could be used to initialise a sub- word HMM for the phone 
ih 



HInit -S trainlist -H globals -M dirl -1 ih -L labs proto 
mv dirl/proto dirl/ih 



where the option -1 defines the name of the sub-word model, and the file trainlist is assumed to 
hold 
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data/trl .mf c 
data/tr2 .mf c 
data/tr3 .mf c 
data/tr4 .mf c 
data/tr5 .mf c 
data/tr6 .mf c 

In this case, HInit will first try to find label files corresponding to each data file. In the example 
here, the standard -L option indicates that they are stored in a directory called labs. As an 
alternative, they coul^be stored in a Master Label File (MLF) and loaded via the standard option 
-I. Once the label Hldsoave been loaded, each data file is scanned and all segments corresponding 
the label ih are loaded^^ure 8.4 illustrates this process. 

All HTK tools suppof^ihe -T trace option and although the details of tracing varies from tool 
to tool, setting the least signicant bit (e.g. by -T 1), causes all tools to output top level progress 
information. In the case of I^^T, this information includes the log likelihood at each iteration and 
hence it is very useful for momWring convergence. For example, enabling top level tracing in the 
previous example might result in the following being output 

HMM proto • 
2 3 4 (wia^k) 
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The first part summarises the structure of the HMM, in this case, cSe data is single stream MFCC 
coefficients with energy and deltas appended. The HMM has 3 emitting)ste,tes, each single Gaussian 
and the stream width is 26. The current option settings are then giveiKKilJDwed by the convergence 
information. In this example, convergence was reached after 6 iterationa^^owever if the maxlter 
limit was reached, then the process would terminate regardless. ^\ 

HInit provides a variety of command line options for controlling its detailed behaviour. The 
types of parameter estimated by HInit can be controlled using the -u opfion. for example, -u 
mtw would update the means, transition matrices and mixture component weigkts but would leave 
the variances untouched. A variance floor can be applied using the -v to pre"^^t any variance 
getting too small. This option applies the same variance floor to all speech vector^f^^jaients. More 
precise control can be obtained by specifying a variance macro (i.e. a v macro) called varFloorl 
for stream 1, varFloor2 for stream 2, etc. Each element of these variance vectors then defines a 
floor for the corresponding HMM variance components. 

The full list of options supported by HInit is described in the Reference Section. 



8.3 Flat Starting with HCompV 

One limitation of using HInit for the initialisation of sub-word models is that it requires labelled 
training data. For cases where this is not readily available, an alternative initialisation strategy is 
to make all models equal initially and move straight to embedded training using HERest. The 
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idea behind this so-called flat start training is similar to the uniform segmentation strategy adopted 
by HInit since by making all states of all models equal, the first iteration of embedded training 
will effectively rely on a uniform segmentation of the data. 



Proto HMM 
Definition 



I 



Identi'cal 



HCompV 



Sample of 
Training 
Speech 



(jT) Ceh) (b) (d) etc 

\ 

8.5 Flat Start Initialisation 

Flat start initialisation is providec 
by Fig 8.5. The input/output of 
exactly the same way as described above 



the HTK tool HCompV whose operation is illustrated 
definition files and training files in HCompV works in 
«^or JHnit. It reads in a prototype HMM definition and 
some training data and outputs a new dei^nition in which every mean and covariance is equal to 
the global speech mean and covariance. ThXS) for example, the following command would read a 
prototype definition called proto, read in all sp(^^h vectors from datal, data2, dataS, etc, compute 
the global mean and covariance and write out a,j^w version of proto in dirl with this mean and 
covariance. \» 

HCompV -m -H globals -M dirl proto datarjiata2 data3 . . . 



The default operation of HCompV is only to upa(afe,the covariances of the HMM and leave 
the means unchanged. The use of the -m option above @ises the means to be updated too. This 
apparently curious default behaviour arises because HCo^^/ is also used to initialise the variances 
in so-called Fixed- Variance HMMs. These are HMMs initialised in the normal way except that all 
covariances are set equal to the global speech covariance anSji^er subsequently changed. 

Finally, it should be noted that HCompV can also be used' t a generate variance floor macros 
by using the -f option. \^ _ 
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I^!^e^t Operation 



HREST is the fina, tool i„ the designed to M&ula.e isolated unit HMMs^ Its operation is 

very similar to HInit except that, as shown in Fig 8.6,(tj^xpects the input HMM definition to have 
been initiahsed and it uses Baum- Welch re-estimation ^Kplace of Viterbi training. This involves 
finding the probability of being in each state at each time frame using the Forward- Backward 
algorithm. This probability is then used to form weighred averages for the HMM parameters. 
Thus, whereas Viterbi training makes a hard decision as t(9 wbich state each training vector was 
"generated" by, Baum- Welch takes a soft decision. This can be^^lpful when estimating phone-based 
HMMs since there are no hard boundaries between phones in r^ki^peech and using a soft decision 
may give better results. The mathematical details of the Baum^^elch re-estimation process are 
given below in section 8.8. /-n 

HRest is usually applied directly to the models generated by^HmiT. Hence for example, the 
generation of a sub-word model for the phone ih begun in sectioi?J^^ would be continued by 
executing the following command 



HRest -S trainlist -H dirl/globals -M dir2 -1 ih -L labs 



^/ih 



This will load the HMM definition for ih from dirl, re-estimate the parameters using the speech 
segments labelled with ih and write the new definition to directory dir2. 

If HRest is used to build models with a large number of mixture components ^reu: state, a strat- 
egy must be chosen for dealing with defunct mixture components. These are mixm*e components 
which have very little associated training data and as a consequence either the ^^lances or the 
corresponding mixture weight becomes very small. If either of these events happen, the mixture 
component is effectively deleted and provided that at least one component in that state is left, a 
warning is issued. If this behaviour is not desired then the variance can be floored as described 
previously using the -v option (or a variance floor macro) and/or the mixture weight can be floored 
using the -w option. 

Finally, a problem which can arise when using HRest to initialise sub-word models is that of 
over-short training segments. By default, HRest ignores all training examples which have fewer 
frames than the model has emitting states. For example, suppose that a particular phone with 3 
emitting states had only a few training examples with more than 2 frames of data. In this case, 
there would be two solutions. Firstly, the number of emitting states could be reduced. Since 
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HTK does not require all models to have the same number of states, this is perfectly feasible. 
Alternatively, some skip transitions could be added and the default reject mechanism disabled by 
setting the -t option. Note here that HInit has the same reject mechanism and suffers from the 
same problems. HInit, however, does not allow the reject mechanism to be suppressed since the 
uniform segmentation process would otherwise fail. 



8.5 Embedded Training using HERest 

Whereas isolated uni^training is sufficient for building whole word models and initialisation of 
models using hand-lapsed bootstrap data, the main HMM training procedures for building sub- 
word systems revolve aa&tod the concept of embedded training. Unlike the processes described so 
far, embedded training sifgjjitaneously updates all of the HMMs in a system using all of the training 
data. It is performed by HJCRest which, unlike HRest, performs just a single iteration. 

In outline, HERest work^vis follows. On startup, HERest loads in a complete set of HMM 
definitions. Every training nre^ust have an associated label file which gives a transcription for 
that file. Only the sequence ofi-iabels is used by HERest, however, and any boundary location 
information is ignored. Thus, thQl^ transcriptions can be generated automatically from the known 
orthography of what was said and^<Q)ronunciation dictionary. 

HERest processes each training ffl^ in turn. After loading it into memory, it uses the associated 
transcription to construct a compositerHMM which spans the whole utterance. This composite 
HMM is made by concatenating instanees of the phone HMMs corresponding to each label in the 
transcription. The Forward- Backward s^orithm is then applied and the sums needed to form 
the weighted averages accumulated in th^^^ormal way. When all of the training files have been 
processed, the new parameter estimates are '&5^ed from the weighted sums and the updated HMM 
set is output. .^r^ 

The mathematical details of embedded Baum-Welch re-estimation are given below in section 8.8. 

In order to use HERest, it is first necessary ^o qpjistruct a file containing a list of all HMMs in 
the model set with each model name being writteJl(ona separate line. The names of the models in 
this list must correspond to the labels used in the tramscriptions and there must be a corresponding 
model for every distinct transcription label. HEResz^ typically invoked by a command line of 
the form 

HERest -S trainlist -I labs -H dirl/hmacs -M(^r2 hmmlist 

where hmmlist contains the list of HMMs. On startup, HE]^^^ will load the HMM master macro 
file (MMF) hmacs (there may be several of these). It then seaoaies for a definition for each HMM 
listed in the hmmlist, if any HMM name is not found, it attempts to open a file of the same name 
in the current directory (or a directory designated by the -d op^n). Usually in large subword 
systems, however, all of the HMM definitions will be stored in MMBs>» Similarly, all of the required 
transcriptions will be stored in one or more Master Label Files (IVt^Fsi, and in the example, they 
are stored in the single MLF called labs. ^ v 

o 
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Fig. 8.7 5^ Processing in HERest 

Once all MMFs and MLFs have been loaded, HERest processes each file in the trainlist, 
and accumulates the required statistics as desm^ed above. On completion, an updated MMF is 
output to the directory dir2. If a second iteration is^required, then HERest is reinvoked reading 
in the MMF from dir2 and outputing a new one f>^d.ir3, and so on. This process is illustrated by 
Fig 8.7. 

When performing embedded training, it is good ^i'actice to monitor the performance of the 
models on unseen test data and stop training when ncrnrirther improvement is obtained. Enabling 
top level tracing by setting -T 1 will cause HERest to euiput the overall log likelihood per frame 
of the training data. This measure could be used as a terniNjation condition for repeated application 
of HERest. However, repeated re-estimation to convergei*ce may take an impossibly long time. 
Worse still, it can lead to over-training since the models calf^come too closely matched to the 
training data and fail to generalise well on unseen test data. Hwip^in practice around 2 to 5 cycles 
of embedded re-estimation are normally sufficient when training nhone models. 

In order to get accurate acoustic models, a large amount of waining data is needed. Several 
hundred utterances are needed for speaker dependent recognition ^aji*«everal thousand are needed 
for speaker independent recognition. In the latter case, a single itc^^ion of embedded training 
might take several hours to compute. There are two mechanisms for sae^oing up this computation. 
Firstly, HERest has a pruning mechanism incorporated into its forward-backward computation. 
HERest calculates the backward probabilities Pj{t) first and then the forrras&d probabilities aj{t). 
The full computation of these probabilities for all values of state j and timfe t is unnecessary since 
many of these combinations will be highly improbable. On the forward pass, HERest restricts the 
computation of the a values to just those for which the total log likelihood as(^termined by the 
product aj{t)(3j{t) is within a fixed distance from the total likelihood P{0\M)^^\us pruning is 
always enabled since it is completely safe and causes no loss of modelling accuracj^^A^ 

Pruning on the backward pass is also possible. However, in this case, the likdjiiiood product 
aj{t)f3j{t) is unavailable since aj{t) has yet to be computed, and hence a much broader beam must 
be set to avoid pruning errors. Pruning on the backward path is therefore under user control. It is 
set using the -t option which has two forms. In the simplest case, a fixed pruning beam is set. For 
example, using -t 250.0 would set a fixed beam of 250.0. This method is adequate when there is 
sufficient compute time available to use a generously wide beam. When a narrower beam is used, 
HERest will reject any utterance for which the beam proves to be too narrow. This can be avoided 
by using an incremental threshold. For example, executing 



HERest -t 120.0 60.0 240.0 -S trainlist -I labs \ 
-H dirl/hmacs -M dir2 hmmlist 
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would cause HERest to run normally at a beam width of 120.0. However, if a pruning error occurs, 
the beam is increased by 60.0 and HERest reprocesses the offending training utterance. Repeated 
errors cause the beam width to be increased again and this continues until either the utterance is 
successfully processed or the upper beam limit is reached, in this case 240.0. Note that errors which 
occur at very high beam widths axe often caused by transcription errors, hence, it is best not to set 
the upper limit too high. 
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Fig. 8.8 HER(^^ arallel Operation 



The second way of speeding-up the operation of (J^Rest is to use more than one computer in 
parallel. The way that this is done is to divide the traiq^g data amongst the available machines and 
then to run HERest on each machine such that each iiw)^cation of HERest uses the same initial 
set of models but has its own private set of data. By setting, the option -p N where N is an integer, 
HERest will dump the contents of all its accumulatorsMnto a file called HERN.acc rather than 
updating and outputing a new set of models. These dumpeJi filgs are collected together and input 
to a new invocation of HERest with the option -p 0 set. H^^^;ST then reloads the accumulators 
from all of the dump files and updates the models in the normaLxay. This process is illustrated in 
Figure 8.8. 

To give a concrete example, suppose that four networked workstatic)ns were available to execute 
the HERest command given earlier. The training files listed praikrasly in trainlist would be 
split into four equal sets and a list of the files in each set stored in'^^^istl, trlist2, trlistS, 
and trlist4. On the first workstation, the command ^""^ 

HERest -S trlistl -I labs -H dirl/hmacs -M dir2 -p 1 hmml' 

would be executed. This will load in the HMM definitions in dirl/hmacs, ^process the files listed 
in trlistl and finally dump its accumulators into a file called HERl.acc in tha output directory 
dir2. At the same time, the command \. 



HERest -S trlist2 -I labs -H dirl/hmacs -M dir2 -p 2 hmmlist 



would be executed on the second workstation, and so on. When HERest has finished on all four 
workstations, the following command will be executed on just one of them 

HERest -H dirl/hmacs -M dir2 -p 0 hmmlist dir2/*.acc 

where the list of training files has been replaced by the dumped accumulator files. This will cause the 
accumulated statistics to be reloaded and merged so that the model parameters can be reestimated 
and the new model set output to dir2 The time to perform this last phase of the operation is 
very small, hence the whole process will be around four times quicker than for the straightforward 
sequential case. 



8.6 Single-Pass Retraining 



130 



8.6 Single-Pass Retraining 

In addition to re-estimating the parameters of a HMM set, HERest also provides a mechanism for 
mapping a set of models trained using one parameterisation into another set based on a different 
parameterisation. This facility allows the front-end of a HMM-based recogniser to be modified 
without having to rebuild the models from scratch. 

This facility is known as single-pass retraining. Given one set of well-trained models, a new 
set matching a different training data parameterisation can be generated in a single re-estimation 
pass. This is done by computing the forward and backward probabilities using the original models 
together with the ori^ial training data, but then switching to the new training data to compute 
the parameter estimat^^or the new set of models. 

Single-pass retrainirtgne enabled in HERest by setting the -r switch. This causes the input 
training files to be read i^^airs. The first of each pair is used to compute the forward/backward 
probabilities and the second is used to estimate the parameters for the new models. Very often, of 
course, data input to HTK i-fiG^jodified by the HParm module in accordance with parameters set 
in a configuration file. In singl^-'mss retraining mode, configuration parameters can be prefixed by 
the pseudo-module names HPARKland HPARM2. Then when reading in the first file of each pair, only 
the HPARMl parameters are usedSan^ when reading the second file of each pair, only the HPARM2 
parameters are used. ^ 

As an example, suppose that a s^^f models has been trained on data with MFCC_E_D parame- 
terisation and a new set of models usii^Cepstral Mean Normalisation (_Z) is required. These two 
data parameterisations are specified in ,^<-teonfiguration file (config) as two separate instances of 
the configuration variable TARGETKIND i.er . , 

v\ 

# Single pass retraining \^ 
HPARMl: TARGETKIND = MFCC_E_D '^r^ 
HPARM2: TARGETKIND = MFCC_E_D_Z 

HERest would then be invoked with the -r optiony^ATto enable single-pass retraining. For example, 

HERest -r -C config -S trainList -I labC^H dirl/hmacs -M dir2 hmmList 



The script file trainlist contains a list of data file pafrk^or each pair, the first file should match 
the parameterisation of the original model set and the second file should match that of the required 
new set. This will cause the model parameter estimatekVo be performed using the new set of 
training data and a new set of models matching this data "v»ill he output to dir2. This process of 
single-pass retraining is a significantly faster route to a new S^Cpf models than training a fresh set 
from scratch. 

o 

8.7 Two-model Re- Estimation 

Another method for initialisation of model parameters implemented \:^JHERest is two-model re- 
estimation. HMM sets often use the same basic units such as triphomfe--Dut differ in the way the 
underlying HMM parameters are tied. In these cases two-model re-estima^2^ can be used to obtain 
the state-level alignment using one model set which is used to update the j^^'ameters of a second 
model set. This is helpful when the model set to be updated is less well trained. 

A typical use of two-model re-estimation is the initialisation of state clustere4.triphone models. 
In the standard case triphone models are obtained by cloning of monophone mooe&and subsequent 
clustering of triphone states. However, the unclustered triphone models are considieEably less power- 
ful than state clustered triphone HMMs using mixtures of Gaussians. The conseque^eg) is poor state 
level alignment and thus poor parameter estimates, prior to clustering. This can be ameliorated by 
the use of well-trained alignment models for computing the forward-backward probabilities. In the 
maximisation stage of the Baum- Welch algorithm the state level posteriors are used to re-restimate 
the parameters of the update model set. Note that the corresponding models in the two sets must 
have the same number of states. 

As an example, suppose that we would like to update a set of cloned single Gaussian monophone 
models in dirl/hmacs using the well trained state-clustered triphones in dir2/hmacs as alignment 
models. Associated with each model set are the model lists hmmlistl and hininlist2 respectively. 
In order to use the second model set for alignment a configuration file config. 2model containing 
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# alignment model set for two-model re-estimation 
ALIGNMODELMMF = dir2/hmacs 
ALIGNHMMLIST = hmmlist2 

is necessary. HERest only needs to be invoked using that configuration file. 

HERest -C config -C conf ig. 2model -S trainlist -I labs -H dirl/hmacs -M dir3 hmmlistl 

The models in directory dirl are updated using the alignment models stored in directory dir2 
and the result is written to directory dirS. Note that trainlist is a standard HTK script and 
that the above comm^^ uses the capability of HERest to accept multiple configuration files on the 
command line. If each is stored in a separate file, the configuration variables ALIGNMODELDIR 

and ALIGNMDDELEXT can^ used. 

Only the state level aM^ment is obtained using the alignment models. In the exceptional case 
that the update model set Contains mixtures of Gaussians, component level posterior probabilities 
are obtained from the updat^^odels themselves. 



imation Formulae 




8.8 Parameter Re 

For reference purposes, this section l^S^ the various formulae employed within the HTK parameter 
estimation tools. All are standard, hwever, the use of non-emitting states and multiple data 
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streams leads to various special cases wmsh are usually not covered fully in the literature. 
The following notation is used in thisysection 

number of states 
number of streams ■ _ 

number of mixture compoiiemg, in stream s 
number of observations \ 

number of models in an embed^d training sequence 
number of states in the q'th modeKin a training sequence 
a sequence of observations > 
the observation at time t, 1 <t < 
the observation vector for stream s cK; time t 
the probability of a transition from std^ i to j 
weight of mixture component m in state^' stream s 
vector of means for the mixture componeiite^ of state j stream s 
covariance matrix for the mixture componertt m of state j stream s 
the set of all parameters defining a HMM \ 

In this style of model training, a set of training observations , 1 

parameters of a single HMM by iteratively computing Viterbi alignmeir- -. . . 

a new HMM, the Viterbi segmentation is replaced by a uniform segmei^a^^ion (i.e. each training 
observation is divided into N equal segments) for the first iteration. 

Apart from the first iteration on a new model, each training sequence Ois segmented using a 
state alignment procedure which results from maximising 



8.8.1 Viterbi Training (HInit) 



R is used to estimate the 
When used to initialise 



for 1 < i < where 



with initial conditions given by 



0Ar(T) = max0i(r)aiAr 



o 

% 



<^l(l) = l 



jbjioi) 



for 1 < j < A^. In this and all subsequent cases, the output probability hj{-) is as defined in 
equations 7.1 and 7.2 in section 7.1. 
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If Aij represents the total number of transitions from state i to state j in performing the above 
maximisations, then the transition probabilities can be estimated from the relative frequencies 



The sequence of states which maximises (j)]sf (T) implies an alignment of training data observa- 
tions with states. Within each state, a further alignment of observations to mixture components is 
made. The tool HInit provides two mechanisms for this: for each state and each stream 

1. use clustering t^^ocate each observation Ogt to one of Mg clusters, or 

2. associate each obse^^tion Ogt with the mixture component with the highest probability 

In either case, the net resvjt is that every observation is associated with a single unique mixture 
component. This association-^^^ be represented by the indicator function V'Jsm(^) which is 1 if o^j 
is associated with mixture conjpsnent m of stream s of state j and is zero otherwise. 
The means and variances ai^then estimated via simple averages 



E 



r=l ^^^t 



y53r=l Et=l i^jsmi^) 



nent 



Finally, the mixture weights are based oi^^^e number of observations allocated to each compo- 



Z—^r— 



c 

8.8.2 Forward/Backward Probabilities^ 

Baum- Welch training is similar to the Viterbi training^^^scribed in the previous section except 
that the hard boundary implied by the tp function is r^laced by a soft boundary function L 
which represents the probability of an observation being associated any given Gaussian mixture 
component. This occupation probability is computed from tfiejSrward and backward probabilities. 

For the isolated-unit style of training, the forward prohabiUt^ a jit) for 1 < j < iV and 1 < t < T 
is calculated by the forward recursion 



■Af-l 



o 



with initial conditions given by 



ai(l) = l 
= aijbj{oi) 

for I < j < N and final condition given by 

N-l 

o:n{T) = ^ ai{T)aiN 



<6 
O 



O 

% 



The backward probability l3i(t) for 1 < i < N and T > t > 1 is calculated by the backward 
recursion 



with initial conditions given by 



A(i)= ^a,,6,(ot+i)/3,(i+l) 
(3i{T) = flijv 
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for 1 < i < and final condition given by 

N-l 

/3i(l)= ^ ay5,(oi)/3,(l) 

In the case of embedded training where the HMM spanning the observations is a composite 
constructed by concatenating Q subword models, it is assumed that at time t, the a and (3 values 
corresponding to the entry state and exit states of a HMM represent the forward and backward 
probabilities at time t-^At and t+At, respectively, where At is small. The equations for calculating 
a and (3 are then as f^tews. 

For the forward pro^ijpility, the initial conditions are established at time t = 1 as follows 

^ / ^ ifg-=l 

•"^1 ^^"1 ai'~'Hl)4';v,'_\ otherwise 



afil)^a[fb'j'^\o,) 



4 



Ng-l 



1=2 



where the superscript in parentheses reSerSjto the index of the model in the sequence of concatenated 
models. All unspecified values of a are ^epo. For time t > 1, 



a fit) 



.=2 

(J) 

For the backward probability, the initial conditions are set ^t^^mie t = T as follows 

othe^e 

where once again, all unspecified (3 values are zero. For time t < T, ^ 

Pf{t)= E afhf{o,)pf{t) 

The total probability P = prob(0|A) can be computed from either the forward or backward 
probabilities 
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8.8.3 Single Model Reestimation(HRest) 

In this style of model training, a set of training observations , 1 < r < i? is used to estimate the 
parameters of a single HMM. The basic formula for the reestimation of the transition probabilities 

„ ^ Ell TT EJu' am^^Mot+i)Pjit + 1) 

where 1 < i < and 1 < j < TV and is the total probability P = prob(0'^|A) of the r'th 
observation. The tractions from the non-emitting entry state are reestimated by 

4^ . 1 A I 



where 1 < j < N and the transitions from the emitting states to the final non-emitting exit state 



are reestimated by 



where 1 < i < N. ^ 

For a HMM with Ms mixture conjopnents in stream s, the means, covariances and mixture 
weights for that stream are reestimatea as follows. Firstly, the probability of occupying the m'th 
mixture component in stream s at time iyror the r'th observation is 

where ^\ 



m) = \ s^N-^fl^'^ (8.1) 
I Ei=2 M^-ij otherwise 

and 



For single Gaussian streams, the probability of mixture co^^onent occupancy is equal to the prob- 
ability of state occupancy and hence it is more efficient in tfeis case to use 

. . o 

Given the above definitions, the re-estimation formulae may now^^^^xpressed in terms of LJj,j^(t) 
as follows. 

t^jsva R TV (f\ O 

^ _ Er=l Et=l -^jsm('')(''st ~ Ajsm)(''st ~ Ajsm) • (8 2) 

Er=l Et = l ^js-m(^) O 

_ Er=l Et=l -^jsmW 

8.8.4 Embedded Model Reestimation(HERest) 

The re-estimation formulae for the embedded model case have to be modified to take account of the 
fact that the entry states can be occupied at any time as a result of transitions out of the previous 
model. The basic formulae for the re-estimation of the transition probabilities is 
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The transitions from the non-emitting entry states into the HMM are re-estimated by 



a 



and the transitions out of the HMM into the non-emitting exit states are re-estimated by 

Finally, the direct tran^ions from non-emitting entry to non-emitting exit states are re-estimated 
by ^ 



The re-estimation formula^^r the output distributions are the same as for the single model 
case except for the obvious adcKtipnal subscript for q. However, the probability calculations must 
now allow for transitions from tnSi^^^ry states by changing Uj{t) in equation 8.1 to 

' ^ ^ ~ I + EtV^ -l^^^t - otherwise 

V 

8.8.5 Semi-Tied Transform Es^Ahation(HERest) 

In addition to estimating the standard parai^elers above HERest can be used to estimated semi- 
tied transforms and HLDA projections. This'^^fetion describes semi-tied transforms, the updates 
for HLDA are very similar. 

Semi-tied covariance matrices have the form 

fi^^^fi^^, ^^^=^r^^Z^Hl (8.3) 



For efficiency reasons the transforms are stored and lirae^i^ods calculated using 
Af(o;,.,„,,ff,s;";"H^) = -i-Ar(if;'o;H;V,„,,S©) = l-4,|Ar(A,o; s;,;;«) (8.4) 

the model files rather than the 

original mean for efficiency. 

The estimation of semi-tied transforms is a doubly iteratiw ^process. Given a current set of 
covariance matrix estimates the semi-tied transforms are estimatedr^ a similar fashion to the full 
variance MLLRCOV transforms. 



X I ^\ ^ (8-5) 



where a^i is i*^ row of Ar, the 1 x n row vector Cri is the vector of cofactors W Ar, Crij = coi{ Arij), 
and Gr ^ is defined as * 

mr = l '^niri t=l <0 

This iteratively estimates one row of the transform at a time. The number of iterations is controlled 
by the HAdapt configuration variable MAXXFORMITER. 

Having estimated the transform the diagonal covariance matrix is updated as 

. diag ( -^^SL^^,.(^)(oW-/^.J(oW-/^„J-An 

This is the second look as given a new estimate of the diagonal variance a new transform can be 
estimated. The number of iterations of transform and covarinace matrix update is controlled by 
the HAdapt configuration variable MAXSEMITIEDITER 
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Chapter 8 described how the parameters are estimated for plam jsontinuous density HMMs within 
HTK, primarily using the embedded training tool HERest. Usingn^e training strategy depicted 
in figure 8.2, together with other techniques can produce high p^^pwnance speaker independent 
acoustic models for a large vocabulary recognition system. However it^^possible to build improved 
acoustic models by tailoring a model set to a specific speaker. By cqji^^ing data from a speaker 
and training a model set on this speaker's data alone, the speaker's charaot«'istics can be modelled 
more accurately. Such systems are commonly known as speaker dependent sysj^ms, and on a typical 
word recognition task, may have half the errors of a speaker independent system. The drawback 
of speaker dependent systems is that a large amount of data (typically hours )i must be collected in 
order to obtain sufficient model accuracy. r~\ 

Rather than training speaker dependent models, adaptation techniques can c*ff^pplied. In this 
case, by using only a small amount of data from a new speaker, a good speaker ino^ptendent system 
model set can be adapted to better fit the characteristics of this new speaker. 

Speaker adaptation techniques can be used in various different modes. If the true transcription of 
the adaptation data is known then it is termed supervised adaptation, whereas if the adaptation data 
is unlabelled then it is termed unsupervised adaptation. In the case where all the adaptation data 
is available in one block, e.g. from a speaker enrollment session, then this termed static adaptation. 
Alternatively adaptation can proceed incrementally as adaptation data becomes available, and this 
is termed incremental adaptation. 

HTK provides two tools to adapt continuous density HMMs. HERest performs offiine super- 
vised adaptation using various forms of linear transformation and/or maximum a-posteriori (MAP) 
adaptation, while unsupervised adaptation is supported by HVite (using only linear transforma- 
tions). In this case HVite not only performs recognition, but simultaneously adapts the model set 
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as the data becomes available through recognition. Currently, linear transformation adaptation can 
be applied in both incremental and static modes while MAP supports only static adaptation. 

This chapter describes the operation of supervised adaptation with the HERest tools. The 
first sections of the chapter give an overview of linear transformation schemes and MAP adaptation 
and this is followed by a section describing the general usages of HERest to build simple and more 
complex adapted systems. The chapter concludes with a section detailing the various formulae used 
by the adaptation tool. The use of HVite to perform unsupervised adaptation is described in the 
RM Demo. 




9.1 Model iroaptation using Linear Transformations 

9.1.1 Linear Trar^formations 

This section briefly discusses the forms of transform available. Note that this form of adaptation is 
only available with diagonal santkiuous density HMMs. 

The transformation matri^f^are all obtained by solving a maximisation problem using the 
Expectation- Maximisation (EM)([^chnique. Using EM results in the maximisation of a standard 
auxiliary function. (Full details ar^^^ailable in section 9.4.) 

Maximum Likelihood Linear Re^^^sion (MLLRMEAN) 

Maximum likelihood linear regression ory^^^LR computes a set of transformations that will reduce 
the mismatch between an initial model and the adaptation data^. More specifically MLLR 
is a model adaptation technique that estimates a set of linear transformations for the mean and 
variance parameters of a Gaussian mixture ^MM system. The effect of these transformations is to 
shift the component means and alter the vari^^es in the initial system so that each state in the 
HMM system is more likely to generate the adaj^tion data. 

The transformation matrix used to give a new^^tflmate of the adapted mean is given by 

A=W^ (9.1) 

where W is the n x (n + I) transformation matrix (whet^n is the dimensionality of the data) and 
^ is the extended mean vector, 

$,^[W fli fl2 ■■■ t^rtY 

where w represents a bias offset whose value is fixed (within H^^K) at 1. 
Hence W can be decomposed into ^ \ 

W=[bA] X (9.2) 

where A represents an n x n transformation matrix and h represents a bias vector. This form of 
transform is referred to in the code as MLLRMEAN. 

Variance MLLR (MLLRVAR and MLLRCOV) vtj 



There are two standard forms of linear adaptation of the variances. The Sre^is of the form 

i; = B^HB 



where H is the linear transformation to be estimated and B is the inverse Choleski factor 

of so 

T,-^ = CC^ 

and 

B = 



This form of transform results in an effective full covariance matrix if the transform matrix H is 
full. This makes likelihood calculations highly inefficient. This form of transform is only available 
with a diagonal transform and in conjunction with estimating an MLLR transform. The MLLR 
transform is used as a parent transform for estimating H. This form of transform is referred to in 
the code as MLLRVAR. 



^ MLLR can also be used to perform environmental compensation by reducing the mismatch due to channel or 
additive noise effects. 
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An alternative more efficient form of variance transformation is also available. Here, the trans- 
formation of the covariance matrix is of the form 

i; = HT,H, (9.3) 

where H is the n x n covariance transformation matrix. This form of transformation, referred to 
in the code as MLLRCOV can be efficiently implemented as a transformation of the means and the 
features. 

AA(o; (1, HY^H) = j^M{H-^o; H^fi, S) \A\MiAo; Afi, E) (9.4) 
l-n I 

where A — H^. Usi^^fhis form it is possible to estimate and efficiently apply full transformations. 
MLLRCOV transformation are normally estimated using MLLRMEAN transformations as the parent 
transform. 

Constrained MLLR (CMBLR 



where A represents an n x n transformation matrix and Jij.>fepresents a bias vector. This form of 
transform is referred to in the code as CMLLR. 



Constrained maximum likelinop^ linear regression or CMLLR computes a set of transformations 
that will reduce the mismatch CkCTween an initial model set and the adaptation data^. More specif- 
ically CMLLR is a feature adap^^oji technique that estimates a set of linear transformations for 
the features. The effect of these tf<ffi>iformations is to shift the feature vector in the initial system 
so that each state in the HMM syst^B) is more likely to generate the adaptation data. Note that 
due to computational reasons, CMLEBris only implemented within HTK for diagonal covariance, 
continuous density HMMs. 

The transformation matrix used to gi^ a new estimate of the adapted mean is given by 

xX= (^-^^ 

where W is the n x (n -I- 1) transformation ri5a1»ax (where n is the dimensionality of the data) and 
^ is the extended observation vector, 

C = [ w Oi o^?. On ] 

where w represents a bias offset whose value is fixec^f^ithin HTK) at 1 . 
Hence W can be decomposed into ^ 

W=[b Aif\^ (9.6) 

Since multiple CMLLR transforms may be used it is impcrtMit to include the Jacobian in the 
likelihood calculation. 

C{o-fi,i:,A,b)^\A\M{Ao+b;i^) (9.7) 

This is the implementation used in the code. 

9.1.2 Input/Output/Parent Transformations \^ 

There are three types of linear transform that may be used with the IJ^^^Tools. 

• Input transform: the input transform is used to determine the forwai@backward probabilities, 
hence the component posteriors, for estimating model and transfor^i^parameters. MLLR 
transforms can be iteratively estimated by refining the posteriors using a newly estimated 
transform. ^-^ 

• Output transform: the output transform is the transform that is generated^^he form of the 
transform is specified using the appropriate configuration options. 

• Parent transform: the parent transform determines the model, or features, on which the 
model set or transform is to be generated. For transform estimation this allows cascades of 
transforms to be used to adapt the model parameters. For model estimation this supports 
speaker adaptive training. Note the current implementation only supports adaptive training 
with CMLLR. Any parent transform can be used when generating transforms. 

There is no difference in the storage of the transform parameters, whether it is to be a parent 
transform or an input transform. There is also no restrictions on the base classes, or regression 
classes, that are used for each transform. 



^ MLLR can also be used to perform environmental compensation by reducing the mismatch due to channel or 
additive noise effects. 
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9.1.3 Base Class Definitions 

The first requirement to allow adaptation is to specify the set of the components that share the 
same transform. This is achieved using a baseclass. The baseclass definition files uses the same 
syntax for defining components as the HHEd command. However, for baseclass definitions the 
components must always be specified. 

~b "global" 
<MMFIDMASK> CUED_WSJ* 
<PARAMETERS> MIXMSE 
<NUMCLASSES> 1 

<CLASS> 1 {*.s^e [2-4] .mix [1-12]} 

• Figure 9.1: Global base class definition 

The simplest form of trani^lOTm uses a global transformation for all components. Figure 9.1 
shows a global transformation 'wxa system where there are upto 3 emitting states and upto 12 
Gaussian components per state, > 

V 

~b ' 'baseclass_4.base ' ' /VN 
<MMFIDMASK> CUED_WSJ* 
<PARAMETERS> MIXBASE \i\ 
<NUMCLASSES> 4 

<CLASS> 1 { (one, sil) .state [2-4^^ix [1-12]} 

<CLASS> 2 {two. state [2-4] .mix [l^tf^} 

<CLASS> 3 {three, state [2-4] .mix [jC^t^]} 

<CLASS> 4 {four, state [2-4] .mix [1-12^ 

Figure 9.2: Four base ^^sses definition 

These baseclasses may be directly used to determinC^^ich components share a particular trans- 
form. However a more general approach is to use a regr^sion class tree. 

9.1.4 Regression Class Trees • . 

To improve the fiexibility of the adaptation process it is possrblE to determine the appropriate set 
of baseclasses depending on the amount of adaptation data tlci^ is available. If a small amount 
of data is available then a global adaptation transform can be g^^rated. A global transform (as 
its name suggests) is applied to every Gaussian component in th^Q^odel set. However as more 
adaptation data becomes available, improved adaptation is possiraelbv increasing the number of 
transformations. Each transformation is now more specific and appnad to certain groupings of 
Gaussian components. For instance the Gaussian components coulcr Wgrouped into the broad 
phone classes: silence, vowels, stops, glides, nasals, fricatives, etc. The aa^tation data could now 
be used to construct more specific broad class transforms to apply to these^^oupings. 

Rather than specifying static component groupings or classes, a robust and dynamic method 
is used for the construction of further transformations as more adaptation data^ecomes available. 
MLLR makes use of a regression class tree to group the Gaussians in the modeNeeL so that the set 
of transformations to be estimated can be chosen according to the amount and t)jpe. of adaptation 
data that is available. The tying of each transformation across a number of mixtvre components 
makes it possible to adapt distributions for which there were no observations at all. With this 
process all models can be adapted and the adaptation process is dynamically refined when more 
adaptation data becomes available. 

The regression class tree is constructed so as to cluster together components that are close in 
acoustic space, so that similar components can be transformed in a similar way. Note that the 
tree is built using the original speaker independent model set, and is thus independent of any new 
speaker. The tree is constructed with a centroid splitting algorithm, which uses a Euclidean distance 
measure. For more details see section 10.7. The terminal nodes or leaves of the tree specify the final 
component groupings, and are termed the base (regression) classes. Each Gaussian component of 
a model set belongs to one particular base class. The tool HHEd can be used to build a binary 
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~r "regtree_4.tree" 

<BASECLASS>~b "baseclass_4.base" 

<NODE> 12 2 3 

<NODE> 2 2 4 5 

<NODE> 3 2 6 7 

<TNDDE> 4 11 

<TNODE> 5 12 

<TNODE> 6 13 

<TNODE> 7 14 

Figure 9.3: Regression class tree example 

regression class tree, and^d label each component with a base class number. Both the tree and 
component base class numbers can be saved as part of the MMF, or simply stored separately. Please 
refer to section 10.7 for furtt^^details. 




egression 

Figure 9.1 shows a simple example of a binary regressi^Dtree with four base classes, denoted as 
{C4, C5, Cg, C7}. During "dynamic" adaptation, the occupaton counts are accumulated for each of 
the regression base classes. The diagram shows a solid arrowfMd circle (or node), indicating that 
there is sufficient data for a transformation matrix to be generated using the data associated with 
that class. A dotted line and circle indicates that there is insumcdfint data. For example neither 
node 6 or 7 has sufficient data; however when pooled at node 3, tftWe is sufficient adaptation data. 
The amount of data that is "determined" as sufficient is set as a cs^^^uration option for HERest 
(see reference section 17.7). \^ 

HERest uses a top-down approach to traverse the regression c\a&^J(^^. Here the search starts 
at the root node and progresses down the tree generating transforms om^-i^r those nodes which 

1. have sufficient data and 

2. are either terminal nodes (i.e. base classes) or have any children withoiit sufficient data. 

In the example shown in figure 9.1, transforms are constructed only for rd^r^ssion nodes 2, 3 
and 4, which can be denoted as W2, W3 and W4. Hence when the transfor^d model set is 
required, the transformation matrices (mean and variance) are applied in the following fashion to 
the Gaussian components in each base class:- < 



W2 


- {C5} 


W3 


^ {^6,^7} 


W4 


- {C4} 



At this point it is interesting to note that the global adaptation case is the same as a tree with 
just a root node, and is in fact treated as such. 

An example of a regression class tree is shown in figure 9.3. This uses the four baseclasses from the 
baseclass macro "baseclass_4.base" . A binary regression tree is shown, thus there are 4 terminal 
nodes. 
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9.1.5 Linear Transform Format 

HERest estimates the required transformation statistics and can either output a set of transfor- 
mation models, or a single transform model file (TMF). The advantage in storing the transforms 
as opposed to an adapted MMF is that the TMFs are considerably smaller than MMFs (especially 
triphone MMFs). This section dgives examples of the format that the transforms are stored in. For 
a description of the transform definition see section 7.10. 

~a ' 'cued' ' 
<ADAPTKIND> BASp 
<BASECLASSES> "global" 
<XFDRMSET> ^ 
<XFORMKIND> CMI^ 



<NUMXFDRMS> 1 
<LINXFDRM> 1 <VE(^IZE> 
<OFFSET> 

<BIAS> 5 ^ 

-0.357 0.001 -0(rC52 0.132 0.072 
<LOGDET> -0.3419 
<BLOCKINFO> 2 3 2 - 
<BLOCK> 1 Qv 
<XFQRM> 3 3 

0.942 -0.032 -0.001 v-^ 
-0.102 0.922 -0.015 y>' 
-0.016 0.045 0.910 ^ 
<BLOCK> 2 . 
<XFORM> 2 2 

1.028 -0.032 ^ 
-0.017 1.041 y>' 
<XFORMWGTSET> ^ 



<CLASSXFORM> 1 1 



Figure 9.4: Example Constrained MLLR tr^^^orm using hard weights 

Figure 9.5 shows the format of a single transform. In th^sa^e fashion as HMMs all transforms 
are stored as macros. The header information gives how thajiransform was estimated, currently 
either with a regression class tree TREE or directly using the base ftiet^es BASE. The base class macro is 
then specified. The form of transformation is then described in the(E^nsformset. The code currently 
supports constrained MLLR (illustrated), MLLR mean adaptation, MLLR full variance adaptation 
and diagonal variance adaptation. Arbitrary block structures ar^--?IItowable. The assignment of 
base class to transform number is specified at the end of the file. VJ . 

The LDGDET value stored with the transform is twice the log-deterM^iJfnt of the transform'^. 

9.1.6 Hierarchy of Transform 

It is possible to specify a hierarchy of transformations. This results from usii% a parent transform 
during the training process. Figure 9.5 shows the use of a set of MLLR transfoi^^ generated using 
a parent CMLLR transform stored in the macro "cued" . The action of this tran^K)'m is 

1. Apply transform cued ^'^^ 

2. Apply transform mjfg 

The parent transform is always applied before the transform itself. 

Hierarchy of transforms automatically result from using a parent transform when estimating a 
transform. 

^There is no advantgc in storing twoce tlie log determininat, however this is maintained for backward compatibihty 
with internal HTK releases. 
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-a "mjfg" 
<ADAPTKIND> TREE^ 
<BASECLASSES> ~b ^J;aseclass_4.base 
<PARENTXFORM> ~a "Aied" 
<XFORMSET> ^ 
<XFORMKIND> MLLRMEAN * 
<NUMXFORMS> 2 
<LINXFORM> 1 <VECSIZE> 
<OFFSET> 
<BIAS> 5 

-0.357 0.001 -0.002 0yX32 0.072 
<BLOCKINFO> 2 3 2 
<BLDCK> 1 
<XFORM> 3 3 

0.942 -0.032 -0.001 
-0.102 0.922 -0.015 
-0.016 0.045 0.910 
<BLOCK> 2 
<XFORM> 2 2 

1.028 -0.032 
-0.017 1.041 
<LINXFORM> 2 <VECSIZE> 
<OFFSET> 
<BIAS> 5 

-0.357 0.001 -0.002 0.132 
<BLDCKINFO> 2 3 2 
<BLDCK> 1 
<XFORM> 3 3 

0.942 -0.032 -0.001 
-0.102 0.922 -0.015 
-0.016 0.045 0.910 
<BLOCK> 2 
<XFORM> 2 2 

1.028 -0.032 
-0.017 1.041 
<XFORMWGTSET> 

<CLASSXFDRM> 1 1 
<CLASSXFDRM> 2 1 
<CLASSXFDRM> 3 1 

<CLASSXFORM> 4 2 O 




O 



Figure 9.5: Example of an MLLR transform using with a parent transform 
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9.1.7 Mutiple Stream Systems 

The specification of the base-class components are given in terms of the Gaussian component. In 
HTK this is specified for a particular stream of the HMM state. When multiple streams are used 
there are two situations to consider"*. 

First, if the streams have the same number of components, then transforms may be shared 
between different streams. For example it may be decided that the same linear transform is to be 
used by the static stream, the delta stream and the delta-delta stream. 

Second, if the streams have different dimensions associated with them. For this case the root 
node is a special nodg^for which a transform cannot be generated. It is required to partition the 
Gaussian components^p^ that all subsequent nodes have the same dimensionality associated with 
them. 



9.2 Adaptive Tfa^ing with Linear Transforms 

In order to improve the perfctf^ance of systems when there are multiple speakers, or acoustic 
environments, present in the trairtjng corpus adaptive training may be used. Here, rather than 
using adaptation transformation^^jiify during test, adaptation transforms are estimated for each 
training speaker. The model, sometmes referred to as a canonical model, is then estimated given 
the set of speaker transforms. In the>iatne fashion as standard training, the whole process can then 
be repeated. ^ 

In the current implementation, adapwe training is only supported with constrained MLLR 
as the transform for each speaker. As X^MhLK is implemented as one, or more, feature-space 
transformations. The estimation formulae ia section 8.8 are simplified modified to accumulate 
statistics using A'-'-'o + 6^*' for all the data ffom speaker i rather than o. The update formula for 
fj,,^^ then becomes 

Specifying that adaptive training is to be used sirn^y requires specifying the parent transform 
that the model set should be built on. Note that usuaSvtiie parent transform will also be used as 
an input transform. \ 

CO 

9.3 Model Adaptation using MAP 

Model adaptation can also be accomplished using a maximum ^^steriori (MAP) approach. This 
adaptation process is sometimes referred to as Bayesian adapt^jfi^n. MAP adaptation involves 
the use of prior knowledge about the model parameter distributioarsHence, if we know what the 
parameters of the model are likely to be (before observing any asaEplation data) using the prior 
knowledge, we might well be able to make good use of the limiteaJEraaptation data, to obtain 
a decent MAP estimate. This type of prior is often termed an infdf'^ptive prior. Note that if 
the prior distribution indicates no preference as to what the model para^^ers are likely to be (a 
non-informative prior), then the MAP estimate obtained will be identical te-^at obtained using a 
maximum likelihood approach. 

For MAP adaptation purposes, the informative priors that are generally ^asgd are the speaker 
independent model parameters. For mathematical tractability conjugate pria);s^are used, which 
results in a simple adaptation formula. The update formula for a single stream ^^em for state j 
and mixture component m is ''^j 

N- T 



where r is a weighting of the a priori knowledge to the adaptation speech data and N is the 
occupation likelihood of the adaptation data, defined as, 



*The current code in HHEd for generating decision trees does not support generating trees for multiple streams. 
However, the code does support adaptation for hand generated trees. 
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where /i,^-,^ is the speaker independent mean and is the mean of the observed adaptation 
data and is defined as, 

_ Sr=lX]t=l^jm(^)or 
Sr=l ^jmW 

As can be seen, if the occupation likelihood of a Gaussian component (Njm) is small, then the 
mean MAP estimate will remain close to the speaker independent component mean. With MAP 
adaptation, every single mean component in the system is updated with a MAP estimate, based 
on the prior mean, the weighting and the adaptation data. Hence, MAP adaptation requires a new 
"speaker-dependent" '^5)del set to be saved. 

One obvious drawb^£^ to MAP adaptation is that it requires more adaptation data to be effective 
when compared to MLIl^because MAP adaptation is specifically defined at the component level. 
When larger amounts of adaptation training data become available, MAP begins to perform better 
than MLLR, due to this (detailed update of each component (rather than the pooled Gaussian 
transformation approach of'QptLR). In fact the two adaptation processes can be combined to 
improve performance still furtit^, by using the MLLR transformed means as the priors for MAP 
adaptation (by replacing /Xj„j in/«auation 9.8 with the transformed mean of equation 9.1). In this 
case components that have a low*ofi^pation likelihood in the adaptation data, (and hence would 
not change much using MAP alonejyiiave been adapted using a regression class transform in MLLR. 
An example usage is shown in the fc^X ing section. 

9.4 Linear Transformatieni .Estimation Formulae 

For reference purposes, this section lists the v^ious formulae employed within the HTK adaptation 
tool. It is assumed throughout that single sh^tpi data is used and that diagonal covariances are 
also used. All are standard and can be found in vicious literature. 
The following notation is used in this section^ 

M. the model set ^ 

M. the adapted model set ^ 

T number of observations 

m a mixture component \ 

O a sequence of d-dimensional observatid^ 

o{t) the observation at time t, 1 < t < T , 

^{t) extended observation at time t, 1 <t < ^""^^ 

{x^^ mean vector for the mixture component ^ 

extended mean vector for the mixture component rrir 

covariance matrix for the mixture component 
LmrW) the occupancy probability for the mixture com^^lent nir 

at time t 

To enable robust transformations to be trained, the transform mat^^^^are tied across a number 
of Gaussians. The set of Gaussians which share a transform is referred to(^a regression class. For 
a particular transform case Wr, the Air Gaussian components {mi, m2,(^ , toa//^} will be tied 
together, as determined by the regression class tree (see section 9.1.4). The standard auxiliary 
function shown below is used to estimate the transforms. 

Q{M,M) = --Y^ J2 E^™^W [^^™^+iog(|S™J) + (o(i)-/i„j^i:;;(o\|k-A„J 

where K^"^^ subsumes all constants and Lm^,{t), the occupation likelihood, is defined as, 

im.(t) =p(9m,.(i) \M,Ot) 

and qm^it) indicates the Gaussian component rrir at time t, and Ot = {o(l), . . . , o{T)} is the adap- 
tation data. The occupation likelihood is obtained from the forward-backward process described in 
section 8.8. 
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9.4.1 Mean Transformation Matrix (MLLRMEAN) 

Substituting the for expressions for MLLR mean adaptation 

Um^^Wr^^^, S™^=S™^ (9.9) 

into the auxihary function, and using the fact that the covariance matrices are diagonal, yields 

^ R d 

where Wrj is the j'*'* ro^'^W' r, 

KU T 

%<tv'= E ^U-C.T.Lrn,Xt) (9.10) 
m,. = l "'r* t=l 

and 

Differentiating the auxiliary function wi^lijrespect to the transform W r , and then maximising it 
with respect to the transformed mean yi^U^^he following update 

w„ ^;)G«-i (9.12) 

The above expressions assume that each bas^regression class r has a separate transform. If 
regression class trees are used then the shared transform parameters may be simply estimated by 
combining the statistics of the base regression clasSes^The regression class tree is used to generate 
the classes dynamically, so it is not known a-priori -wCp^ich regression classes will be used to estimate 
the transform. This does not present a problem, since(^*-*'' and fe*-*'' for the chosen regression class 
may be obtained from its child classes (as defined by the^ree). If the parent node R has children 
{i?i, . . . then ^ 

and 

The same approach of combining statistics from multiple children can tt^slipplied to all the estimation 
formulae in this section. 

9.4.2 Variance Transformation Matrix (MLLRVAR, MLLRC^ 

Estimation of the first variance transformation matrices is only available fo» diagonal covariance 
Gaussian systems in the current implementation, though full transforms can in tffipry be estimated. 



The Gaussian covariance is transformed using^ 



where H„i is the linear transformation to be estimated and B„i is the inverse of the Choleski factor 

of s:;^ , so 



' rn,. ^ m. 

and 



^In the current implementation of the code this form of transform can only be estimated in addition to the 
MLLRMEAN transform 
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After rewriting the auxiliary function, the transform matrix Hm is estimated from, 



Here, is forced to be a diagonal transformation by setting the off-diagonal terms to zero, which 
ensures that S^^, is also diagonal. 

The alternative form of variance adaptation us supported for full, block and diagonal transforms. 
Substituting the for expressions for variance adaptation 

^^^^ Am,, — Mm,. J ^nir " Hr'^m^H'^ (9.13) 

into the auxiliary functio^jand using the fact that the covariance matrices axe diagonal yields 



where 



% 

VJV m,. = l t=l 



r rj 



Mr T 



(9.14) 
(9.15) 



is r row of A^, the 1 x n row vector c^^^^ the vector of cofactors of A^, Crij — cof(Arij), and 
g[*'' is defined as 



E -^E^™^WW'-AmJ(oW-AmJ^ (9.16) 

Differentiating the auxiliary function with respect to(ftje transform A^. , and then maximising 
with respect to the transformed mean yields the foUowi^^j^update 



it 



■ - c -GW-i 



\ Vc„G«-^c? 



(9.17) 



1 



This is an iterative optimisation scheme as the cofactors mean vhe estimate of row i is dependent 
on all the other rows (in that block). For the diagonal transformQlse it is of course non-iterative 
and simplifies to the same form as the MLLRVAR transform. 

9.4.3 Constrained MLLR Transformation Matrix (>^LR) 

Substituting the for expressions for CMLLR adaptation where 

= Hrfl^^ + br, = Hr'Sm^Hj ^ ^ (9.19) 

into the auxiliary function, and using the fact that the covariance matrices are@kgonal yields 



Q{M,M) = K + Y^ 



where 



Wr = [ -AX H-^ ] = [b A] 



®For efficiency this transformation is implemented as 

br(t) = ArOit) +br = W rC.(t) 



(9.20) 



(9.18) 
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Wri is I*'* row of Wr, the 1 x n row vector Pri is the zero extended vector of cofactors of A,,, G^'^ 
and kr*'' are defined as 

M, 



E ;;^Eim.WCWC^W (9.21) 

and 



m, = l i=l 



Mr; 



(in that block). 



Z kW= ^ '^^L„,^(t)C^(t) (9.22) 

m, = l t=l 

Differentiating the auxili^^ function with respect to the transform Wr , and then maximising it 
with respect to the transformed mean yields the following update 

= (ap„ + kW) (9.23) 

where a satisfies . 

a'p«G(*>^^; + ap„G[')-ikW^ - /? = 0 (9.24) 

There are thus two possible solutions f^)Q;. The solutions that yields the maximum increase in the 
auxiliary function (obtained by simply stl^tituting in the two options) is used. This is an iterative 
optimisation scheme as the cofactors meanthfe estimate of row i is dependent on all the other rows 

% 
\ 

(J) 

o 

o 

% 
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In chapter 8, the basic processes involved in training a continupns density HMM system were 
explained and examples were given of building a set of HMM ]3»f5ne models. In the practical 
application of these techniques to building real systems, there are often>a number of problems to 
overcome. Most of these arise from the conflicting desire to have a largeXfi^^ber of model parameters 
in order to achieve high accuracy, whilst at the same time having limited ^n^ uneven training data. 

As mentioned previously, the HTK philosophy is to build systems incrementally. Starting with a 
set of context-independent monophone HMMs, a system can be refined in a seqtience of stages. Each 
refinement step typically uses the HTK HMM definition editor HHEd foUow^by re-estimation 
using HERest. These incremental manipulations of the HMM set often involve^irameter tying, 
thus many of HHEd's operations involve generating new macro definitions. 'Oij 

The principle types of manipulation that can be performed by HHEd are ^ 

• HMM cloning to form context-dependent model sets 

• Generalised parameter tying 

• Data driven and decision tree based clustering. 

• Mixture component splitting 

• Adding/removing state transitions 

• Stream splitting, resizing and recasting 
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This chapter describes how the HTK tool HHEd is used, its editing language and the main opera- 
tions that can be performed. 



10.1 Using HHEd 

The HMM editor HHEd takes as input a set of HMM definitions and outputs a new modified set, 
usually to a new directory. It is invoked by a command line of the form 

HHEd -H MMFl -H MMF2 . . . -M newdir cmds.hed hmmlist 

where cmds.hed is an^djt script containing a list of edit commands. Each command is written on 
a separate line and beg^awith a 2 letter command name. 

The effect of executin^lje above command line would be to read in the HMMs listed in hmmlist 
and defined by files MMFl, MMF2, etc., apply the editing operations defined in cmds.hed and then 
write the resulting system out to the directory newdir. As with all tools, HTK will attempt 
to replicate the file structurMc^^the input in the output directory. By default, any new macros 
generated by HHEd will be w?^en to one or more of the existing MMFs. In doing this, HTK will 
attempt to ensure that the "defiilMbn before use" rule for macros is preserved, but it cannot always 
guarantee this. Hence, it is usuailj^est to define explicit target file names for new macros. This 
can be done in two ways. Firstly, eipijcit target file names can be given in the edit script using the 
UF command. For example, if cmds . Ited-contained 

UF smacs \* 

# commEinds to generate state macros 

UF vmacs ■f\j. 

# commcinds to generate variance macro^^^ 

. ^' 

then the output directory would contain an MMF^^^led smacs containing a set of state macro 
definitions and an MMF called vmacs containing a set>of variance macro definitions, these would 
be in addition to the existing MMF files MMFl, MMF2, 

Alternatively, the whole HMM system can be writtw to a single file using the -w option. For 
example, 

HHEd -H MMFl -H MMF2 . . . -w newMMF cmds.hed hnftilist 



would write the whole of the edited HMM set to the file newK.._ . 

As mentioned previously, each execution of HHEd is norm^l^ followed by re-estimation using 
HERest. Normally, all the information needed by HHEd is conta^n^d in the model set itself. How- 
ever, some clustering operations require various statistics about the (Coining data (see sections 10.4 
and 10.5). These statistics are gathered by HERest and output to asifl^s file, which is then read in 
by HHEd. Note, however, that the statistics file generated by HERESTirefers to the input model 
set not the re-estimated set. Thus for example, in the following sequeiJcs/the HHEd edit script in 
cmds.hed contains a command (see the RO command in section 10.4) wl/t^ references a statistics 
file (called stats) describing the HMM set defined by hmml/MMF. 

HERest -H hmml/MMF -M hmmx -s stats hmmlist trainl train2 

HHEd -H hmml/MMF -M hmm2 cmds.hed hmmlist ^ 

The required statistics file is generated by HERest but the re-estimated m set stored in 
hmmx/MMF is ignored and can be deleted. 

10.2 Constructing Context-Dependent Models 

The first stage of model refinement is usually to convert a set of initialised and trained context- 
independent monophone HMMs to a set of context dependent models. As explained in section 6.4, 
HTK uses the convention that a HMM name of the form 1-p+r denotes the context-dependent 
version of the phone p which is to be used when the left neighbour is the phone 1 and the right 
neighbour is the phone r. To make a set of context dependent phone models, it is only necessary 
to construct a HMM list, called say cdlist, containing the required context-dependent models and 
then execute HHEd with a single command in its edit script 
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CL cdlist 

The effect of this command is that for each model 1-p+r in cdlist it makes a copy of the monophone 
P- 

The set of context-dependent models output by the above must be reestimated using HERest. 
To do this, the training data transcriptions must be converted to use context-dependent labels 
and the original monophone hmm list must be replaced by cdlist. In fact, it is best to do this 
conversion before cloning the monophones because if the HLEd TC command is used then the -n 
option can be used to generate the required list of context dependent HMMs automatically. 

Before building a^^t of context-dependent models, it is necessary to decide whether or not 
cross-word triphones B^to be used. If they are, then word boundaries in the training data can 
be ignored and all moi»^one labels can be converted to triphones. If, however, word internal 
triphones are to be used^feen word boundaries in the training transcriptions must be marked in 
some way (either by an ej!j)licit marker which is subsequently deleted or by using a short pause 
tee-model) . This word bound^a^ marker is then identified to HLEd using the WB command to make 
the TC command use biphonesj;^her than triphones at word boundaries (see section 6.4). 

All HTK tools can read an(r mrite HMM definitions in text or binary form. Text is good for 
seeing exactly what the tools ar^raoducing, but binary is much faster to load and store, and 
much more compact. Binary outj3TO)is enabled either using the standard option -B or by setting 
the configuration variable SAVEBINA^ In the above example, the HMM set input to HHEd will 
contain a small set of monophones whereas the output will be a large set of triphones. In order, to 
save storage and computation, this is usually a good point to switch to binary storage of MMFs. 

V' 

10.3 Parameter Tying ancj^tem Lists 

As explained in Chapter 7, HTK uses macro^Q. support a generalised parameter tying facility. 
Referring again to Fig. 7.7.8, each of the solid(J)mck circles denotes a potential tie-point in the 
hierarchy of HMM parameters. When two or more ^^ameter sets are tied, the same set of parameter 
values are shared by all the owners of the tied set. Externally, tied parameters are represented by 
macros and internally they are represented by structliij^ sharing. The accumulators needed for the 
numerators and denominators of the Baum- Welch re-aSramation formulae given in section 8.8 are 
attached directly to the parameters themselves. Henc^ when the values of a tied parameter set 
are re-estimated, all of the data which would have been^^d to estimate each individual untied 
parameter are effectively pooled leading to more robust parameter estimation. 

Note also that although parameter tying is implemented way which makes it transparent 
to the HTK re-estimation and recognition tools, in practice, mes^ tools do notice when a system 
has been tied and try to take advantage of it by avoiding redunl^ant computations. 

Although macro definitions could be written by hand, in practic^, tying is performed by execut- 
ing HHEd commands and the resulting macros are thus generated ^^^matically. The basic HHEd 
command for tying a set of parameters is the TI command which has^^e form 

Tl macroname itemlist 

This causes all items in the given itemlist to be tied together and c^^mt as a macro called 
macroname. Macro names are written as a string of characters optionally enclosed in double quotes. 
The latter are necessary if the name contains one or more characters which are not letters or digits. 

o 

% 
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10.1 Item List Construction 



Item lists use a simple language^^ identify sets of points in the HMM parameter hierarchy 
illustrated in Fig. 7.7.8. This langua^is defined fully in the reference entry for HHEd. The 
essential idea is that item lists represe»^]jDaths down the hierarchical parameter tree where the 
direction down should be regarded as tra"\ffi]^fig from the root of the tree to towards the leaves. A 
path can be unique, or more usually, it cair he^a pattern representing a set of paths down the tree. 
The point at which each path stops identifies^ne member of the set represented by the item list. 
Fig. 10.1 shows the possible paths down the ftse. In text form the branches are replaced by dots 
and the underlined node names are possible termnating points. At the topmost level, an item list 
is a comma separated list of paths enclosed in bra^^T 

Some examples, should make all this clearer. E^tly, the following is a legal but somewhat 
long-winded way of specifying the set of items comprising states 2, 3 and 4 of the HMM called aa 

{ aa. state [2] , aa. state [3] , aa. state [4] } 

however in practice this would be written much more coni^ctly as 

{ aa. state [2-4] } * 

It must be emphasised that indices in item lists are really patteKa^ The set represented by an item 
list consists of all those elements which match the patterns. Tl||^, if aa only had two emitting 
states, the above item list would not generate an error. It would sin^>lv only match two items. The 
reason for this is that the same pattern can be applied to many diffi^wt objects. For example, the 
HMM name can be replaced by a list of names enclosed in brackets, rurthermore each HMM name 
can include '?' characters which match any single character and '*' chlj^^ers which match zero or 
more characters. Thus 

{ (aa+*,iy+*,eh+*) .state [2-4] } 

represents states 2, 3 and 4 of all biphonc models corresponding to the phonemps\aa, iy and eh. If 
aa had just 2 emitting states and the others had 4 emitting states, then this item- psi would include 
2 states from each of the aa models and 3 states from each of the others. MovingM^mher down the 
tree, the item list <0 

{ *. state [2-4] . stream [1] .mix [1 ,3] . cov } 

denotes the set of all covariance vectors (or matrices) of the first and third mixture components of 
stream 1, of states 2 to 4 of all HMMs. Since many HMM systems are single stream, the stream 
part of the path can be omitted if its value is 1. Thus, the above could have been written 

{ *. state [2-4] .mix [1,3] .cov } 



These last two examples also show that indices can be written as comma separated lists as well as 
ranges, for example, [1,3,4-6,9] is a valid index list representing states 1, 3, 4, 5, 6, and 9. 
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When item hsts are used as the argument to a TI command, the kind of items represented by 
the Hst determines the macro type in a fairly obvious way. The only non-obvious cases are firstly 
that lists ending in cov generate ^v, ^c, or ~x macros as appropriate. If an explicit set of 
mixture components is defined as in 

{ *. state [2] .mix [1-5] > 

then '^m macros are generated but omitting the indices altogether denotes a special case of mixture 
tying which is explained later in Chapter 11. 

To illustrate the use of item lists, some example TI commands can now be given. Firstly, when 
a set of context-depe^tent models is created, it can be beneficial to share one transition matrix 
across all variants of a^Jbone rather than having a distinct transition matrix for each. This could 
be achieved by adding Tl^gmmands immediately after the CL command described in the previous 
section, that is 

CL cdlist 

TI T_ah {*-ah+* . transl?^ 
TI T_eh {*-eli+* . transP W\ 
TI T_ae {*-ae+* . transP}^^^ 
TI T_ih {*-ih+*.transP} 
... etc 

As a second example, a so-called ^mad Variance HMM system can be generated very easily 
with the following HHEd command 

TI "gvar" { *. state [2-4] .mix [1] yg^v > 

where it is assumed that the HMMs are 3-st^[^single mixture component models. The effect of 
this command is to tie all state distributions tq^single global variance vector. For applications, 
where there is limited training data, this tcchniques^ati improve performance, particularly in noise. 

Speech recognition systems will often have distinctiaiodels for silence and short pauses. A silence 
model sil may have the normal 3 state topology whereas a short pause model may have just a 
single state. To avoid the two models competing with ^k^h other, the sp model state can be tied to 
the centre state of the sil model thus 



TI "silst" { sp.state[2], sil.state[3] } 



CO 



So far nothing has been said about how the parametersN^^^ actually determined when a set 
of items is replaced by a single shared representative. When st^es are tied, the state with the 
broadest variances and as few as possible zero mixture component weights is selected from the pool 
and used as the representative. When mean vectors are tied, the«ja?verage of all the mean vectors 
in the pool is used and when variances are tied, the largest varian^^jk the the pool is used. In all 
other cases, the last item in the tie-list is arbitrarily chosen as represe^^tive. All of these selection 
criteria are ad hoc, but since the tie operations are always followed bv;e^licit re-estimation using 
HERest, the precise choice of representative for a tied set is not criticaJy-v 

Finally, tied parameters can be untied. For example, subsequent reS«Kaients of the context- 
dependent model set generated above with tied transition matrices mighty^efeult in a much more 
compact set of models for which individual transition parameters could be robjistly estimated. This 
can be done using the UT command whose effect is to untie all of the items in it^^gument list. For 
example, the command 

UT {*-iy+*. trans?} 

would untie the transition parameters in all variants of the iy phoneme. This untying works by 
simply making unique copies of the tied parameters. These untied parameters can then subsequently 
be re-estimated. 



10.4 Data-Driven Clustering 

In section 10.2, a method of triphone construction was described which involved cloning all mono- 
phones and then re-estimating them using data for which monophone labels have been replaced by 
triphone labels. This will lead to a very large set of models, and relatively little training data for 
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each model. Applying the argument that context will not greatly affect the centre states of triphone 
models, one way to reduce the total number of parameters without significantly altering the models' 
ability to represent the different contextual effects might be to tie all of the centre states across all 
models derived from the same monophone. This tying could be done by writing an edit script of 
the form 

TI "iyS3" {*-iy+*.state[3]} 
TI "ihS3" {*-ih+*.state[3]} 
TI "ehS3" {*-eh+*.state[3]} 
....etc ^ 

Each TI command wor^jitie all the centre states of all triphones in each phone group. Hence, if 
there were an average oi^MQ triphones per phone group then the total number of states per group 
would be reduced from 300 to 201. 

Explicit tyings such as Ihe^e can have some positive effect but overall they are not very satis- 
factory. Tying all centre star«^ too severe and worse still, the problem of undertraining for the 
left and right states remains, ^^.inuch better approach is to use clustering to decide which states 
to tie. HHEd provides two mec isms for this. In this section a data-driven clustering approach 
will be described and in the next s^eCCion, an alternative decision tree-based approach is presented. 

Data-driven clustering is perform^ by the TC and NC commands. These both invoke the same 
top-down hierarchical procedure. Inrtmlly all states are placed in individual clusters. The pair of 
clusters which when combined would vwxn the smallest resultant cluster are merged. This process 
repeats until either the size of the largeSOluster reaches the threshold set by the TC command or 
the total number of clusters has fallen to t^^' specified by by the NC command. The size of cluster 
is defined as the greatest distance betweeuySrky two states. The distance metric depends on the 
type of state distribution. For single Gaussia^iSpa weighted Euclidean distance between the means 
is used and for tied-mixture systems a EuclideaH distance between the mixture weights is used. For 
all other cases, the average probability of each Component mean with respect to the other state is 
used. The details of the algorithm and these metties are given in the reference section for HHEd. 
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Fig. 10.2 Data-driven state tying 



As an example, the following HHEd script would cluster and tie the corresponding states of the 
triphone group for the phone ih 

TC 100.0 "ihS2" . state [2] } 
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TC 100.0 "ihS3" {*-ih+* . state [3] } 
TC 100.0 "iliS4" {*-ih+*.state[4]} 

In this example, each TC command performs clustering on the specified set of states, each cluster is 
tied and output as a macro. The macro name is generated by appending the cluster index to the 
macro name given in the command. The effect of this command is illustrated in Fig. 10.2. Note 
that if a word-internal triphone system is being built, it is sensible to include biphones as well as 
triphones in the item list, for example, the first command above would be written as 

TC 100.0 "ihS2" {(*-ih, ih+* , *-ih+*) . state [2] } 

If the above TC commkh(is are repeated for all phones, the resulting set of tied-state models will 
have far fewer paramet^**in total than the original untied set. The numeric argument immediately 
following the TC commai^jname is the cluster threshold. Increasing this value will allow larger 
and hence, fewer clusters, ^he aim, of course, is to strike the right balance between compactness 
and the acoustic accuracy of-tte individual models. In practice, the use of this command requires 
some experimentation to find^^^^od threshold value. HHEd provides extensive trace output for 
monitoring clustering operatio]ij«\ Note in this respect that as well as setting tracing from the 
command line and the configura^^^eb file, tracing in HHEd can be set by the TR command. Thus, 
tracing can be controlled at the S^imand level. Further trace information can be obtained by 
including the SH command at straijSgjc points in the edit script. The effect of executing this 
command is to list out all of the paranaster tyings currently in force. 

A potential problem with the use o>M£ TC and NC commands is that outlier states will tend to 
form their own singleton clusters for whi^ there is then insufficient data to properly train. One 
solution to this is to use the RO command remove outliers. This commmand has the form 

RO thresh "statsfile" 

where statsfile is the name of a statistics 'S^g^utput using the -s option of HERest. This 
statistics file holds the occupation counts for alNstatgs of the HMM set being trained. The term 
occupation count refers to the number of frames \IIocated to a particular state and can be used 
as a measure of how much training data is availabl^^r estimating the parameters of that state. 
The RO command must be executed before the TC or NO^mmands used to do the actual clustering. 
Its effect is to simply read in the statistics informaticm^om the given file and then to set a flag 
instructing the TC or NC commands to remove any outliers- remaining at the conclusion of the normal 
clustering process. This is done by repeatedly finding the oUsster with the smallest total occupation 
count and merging it with its nearest neighbour. This process i^ repeated until all clusters have a 
total occupation count which exceeds thresh, thereby ensurii^Sthat every cluster of states will be 
properly trained in the subsequent re-estimation performed by JJ^Rest. 

On completion of the above clustering and tying procedures, nwjiy of the models may be effec- 
tively identical, since acoustically similar triphones may share comiHon clusters for all their emitting 
states. They are then, in effect, so-called generalised triphones. Sta^g^ing can be further exploited 
if the HMMs which are effectively equivalent are identified and thei^^^d via the physical-logical 
mapping^ facility provided by HMM lists (see section 7.4). The effeqf'^this would be to reduce 
the total number of HMM definitions required. HHEd provides a compaction command to do all 
of this automatically. For example, the command ^"^A^ 

CO newList 

• 

will compact the currently loaded HMM set by identifying equivalent models a^^then tying them 
via the new HMM list output to the file newList. Note, however, that for two ^^^^Ms to be tied, 
they must be identical in all respects. This is one of the reasons why transitiony^irameters are 
often tied across triphone groups otherwise HMMs with identical states would still he left distinct 
due to minor differences in their transition matrices. 



10.5 Tree-Based Clustering 

One limitation of the data-driven clustering procedure described above is that it does not deal 
with triphones for which there are no examples in the training data. When building word-internal 
triphone systems, this problem can often be avoided by careful design of the training database but 
when building large vocabulary cross-word triphone systems unseen triphones are unavoidable. 

^The physical HMM which corresponding to several logical HMMs will be arbitrarily named after one of them. 
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Fig. 10.3 Decision tre^based state tying 

HHEd provides an alternative decision tree based clust^iaig mechanism which provides a similar 
quality of clustering but offers a solution to the unseen xriphone problem. Decision tree-based 
clustering is invoked by the command TB which is analogou^^^^the TC command described above 



and has an identical form, that is 



TB thresh macroname itemlist 



o 



Apart from the clustering mechanism, there are some other differen^^ between TC and TB. Firstly, 
TC uses a distance metric between states whereas TB uses a log hteWiood criterion. Thus, the 
threshold values are not directly comparable. Furthermore, TC supporta-^ny type of output distri- 
bution whereas TB only supports single-Gaussian continuous density oulptrt^istributions. Secondly, 
although the following describes only state clustering, the TB command cki/also be used to cluster 
whole models. 

A phonetic decision tree is a binary tree in which a ycs/no phonetic question is attached to each 
node. Initially all states in a given item list (typically a specific phone state wosition) are placed 
at the root node of a tree. Depending on each answer, the pool of states is successively split and 
this continues until the states have trickled down to leaf-nodes. All states in the same leaf node are 
then tied. For example. Fig 10.3 illustrates the case of tying the centre states of all t™hones of the 
phone /aw/ (as in "out"). All of the states trickle down the tree and depending on the answer to 
the questions, they end up at one of the shaded terminal nodes. For example, in the illustrated case, 
the centre state of s-aw+n would join the second leaf node from the right since its right context is 
a central consonant, and its right context is a nasal but its left context is not a central stop. 

The question at each node is chosen to (locally) maximise the likelihood of the training data 
given the final set of state tyings. Before any tree building can take place, all of the possible phonetic 
questions must be loaded into HHEd using QS commands. Each question takes the form "Is the 
left or right context in the set P?" where the context is the model context as defined by its logical 
name. The set P is represented by an item list and for convenience every question is given a name. 
As an example, the following command 
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qS "L_Nasal" { ng-* ,n-* ,m-* > 

defines the question "Is the left context a nasal?" . 

It is possible to calculate the log likelihood of the training data given any pool of states (or 
models). Furthermore, this can be done without reference to the training data itself since for 
single Gaussian distributions the means, variances and state occupation counts (input via a stats 
file) form sufficient statistics. Splitting any pool into two will increase the log likelihood since it 
provides twice as many parameters to model the same amount of data. The increase obtained when 
each possible question is used can thus be calculated and the question selected which gives the 
biggest improvement^ 

Trees are therefor^^uilt using a top-down sequential optimisation process. Initially all states 
(or models) are placed lE-^ single cluster at the root of the tree. The question is then found which 
gives the best split of the^st node. This process is repeated until the increase in log likelihood falls 
below the threshold specifigd in the TB command. As a final stage, the decrease in log likelihood 
is calculated for merging terjiripal nodes with differing parents. Any pair of nodes for which this 
decrease is less than the thresnpid used to stop splitting are then merged. 

As with the TC command, it is i:iseful to prevent the creation of clusters with very little associated 
training data. The RQ commandSpamtherefore be used in tree clustering as well as in data-driven 
clustering. When used with trees,^'ffi)y split which would result in a total occupation count falling 
below the value specified is prohibit^ Note that the RO command can also be used to load the 
required stats file. Alternatively, the state file can be loaded using the LS command. 

As with data-driven clustering, usingthe trace facilities provided by HHEd is recommended for 
monitoring and setting the appropriate t^ffesholds. Basic tracing provides the following summary 
data for each tree \^ 

TB 350.00 aw_s3 {} ^ r\ 

Tree based clustering VJ , 

Start aw [3] 
Via aw [3] 
End aw [3] 



28 have LogL=-86.^9y occ=864.2 
5 gives LogL=-84. 4:^000=864. 2 
5 gives LogL=-84.42l^o=864.2 



TB: Stats 28->5 [17.97.] { 4537->285 [sSsi^ total > 



This example corresponds to the case illustrated in Fi&-TV.3. The TB command has been invoked 
with a threshold of 350.0 to cluster the centre states of tne^iphones of the phone aw. At the start 
of clustering with all 28 states in a single pool, the average log likehhood per unit of occupation is 
-86.9 and on completion with 5 clusters this has increased tft -^.4. The middle line labelled "via" 
gives the position after the tree has been built but before ter'^i)ial nodes have been merged (none 
were merged in this case). The last line summarises the overaU^p^sition. After building this tree, 
a total of 4537 states were reduced to 285 clusters. 

As noted at the start of this section, an important advantage oyteee-based clustering is that it 
allows triphone models which have no training data to be synthesi^^sarThis is done in HHEd using 
the AU command which has the form 

AU hmmlist ^ 

Its effect is to scan the given hmmlist and any physical models listed which ^(^^ not in the currently 
loaded set are synthesised. This is done by descending the previously constructed trees for that 
phone and answering the questions at each node based on the new unseen corite&t. When each leaf 
node is reached, the state representing that cluster is used for the correspondingssrate in the unseen 
triphone. O 

The AU command can be used within the same edit script as the tree buildj^ commands. 
However, it will often be the case that a new set of triphones is needed at a later dctte, perhaps as 
a result of vocabulary changes. To make this possible, a complete set of trees can be saved using 
the ST command and then later reloaded using the LT command. 



10.6 Mixture Incrementing 

When building sub-word based continuous density systems, the final system will typically consist 
of multiple mixture component context-dependent HMMs. However, as indicated previously, the 
early stages of triphone construction, particularly state tying, are best done with single Gaussian 
models. Indeed, if tree-based clustering is to be used there is no option. 
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In HTK therefore, the conversion from single Gaussian HMMs to multiple mixture component 
HMMs is usually one of the final steps in building a system. The mechanism provided to do 
this is the HHEd MU command which will increase the number of components in a mixture by a 
process called mixture splitting. This approach to building a multiple mixture component system 
is extremely flexible since it allows the number of mixture components to be repeatedly increased 
until the desired level of performance is achieved. 

The MU command has the form 

MU n itemList 

where n gives the ne^^umber of mixture components required and itemList defines the actual 
mixture distributions tc^gjpdify. This command works by repeatedly splitting the 'heaviest' mixture 
component until the requife^ number of components is obtained. The 'heaviness' score of a mixture 
component is defined as tne mixture weight minus the number of splits involving that component 
that have already been carried out by the current MU command. Subtracting the number of 
sphts discourages repeated spKtJiig of the same mixture component. If the GCONST value 7.10 
of a component is more than fmrf standard deviations smaller than the average gConst, a further 
adjustment is made to the 'heavi^^s' score of the component in order to make it very unlikely that 
the component will be selected foryswitting. The actual split is performed by copying the mixture 
component, dividing the weights ofo^h copies by 2, and finally perturbing the means by plus or 
minus 0.2 standard deviations. For ex^^ple, the command 

MU 3 {aa. state [2] .mix} 

would increase the number of mixture com^rt^ents in the output distribution for state 2 of model aa 
to 3. Normally, however, the number of coiWpJfcnents in all mixture distributions will be increased 
at the same time. Hence, a command of the ii^^ is more usual 

MU 3 {*. state [2-4] .mix} 

It is usually a good idea to increment mixture cornpoj^nts in stages, for example, by incrementing 
by 1 or 2 then re-estimating, then incrementing by Kon2 again and re-estimating, and so on until 
the required number of components are obtained. Tlfe^lso allows recognition performance to be 
monitored to find the optimum. 

One final point with regard to multiple mixture comp*^nt distributions is that all HTK tools 
ignore mixture components whose weights fall below a tmeshold value called MINMIX (defined in 
HModel.h). Such mixture components are called defunct. Dg#unct mixture components can be 
prevented by setting the -w option in HERest so that all Kii«;ture weights are fioored to some 
level above MINMIX. If mixture weights are allowed to fall helb^ MINMIX then the corresponding 
Gaussian parameters will not be written out when the model con(E^ning that component is saved. 
It is possible to recover from this, however, since the MU command^ill replace defunct mixtures 
before performing any requested mixture component increment, ^^^"^^i. 

10.7 Regression Class Tree Construction q 
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a binary regression class tree. This tree is stored in the MMF, along with a»regression base class 
identifier for each mixture component. An example regression tree and how it m^'^be used is shown 
in subsection 9.1.4. HHEd provides the means to construct a regression class tree^j a given MMF, 
and is invoked using the RC command. It is also necessary to supply a statistics file,^>^ich is output 
using the -s option of HERest. The statistics file can be loaded by invoking the LS'^mmand. 

A centroid-splitting algorithm using a Euclidean distance measure is used to grow the binary 
regression class tree to cluster the model set's mixture components. Each leaf node therefore 
specifies a particular mixture component cluster. This algorithm proceeds as follows until the 
requested number of terminals has been achieved. 

• Select a terminal node that is to be split. 

• Calculated the mean and variance from the mixture components clustered at this node. 



• Create two children. Initialise their means to the parent mean perturbed in opposite directions 
(for each child) by a fraction of the variance. 
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• For each component at the parent node assign the component to one of the children by using 
a Euchdean distance measure to ascertain which child mean the component is closest to. 

• Once all the components have been assigned, calculate the new means for the children, based 
on the component assignments. 

• Keep re-assigning components to the children and re-estimating the child means until there 
is no change in assignments from one iteration to the next. Now finalise the split. 

As an example, the following HHEd script would produce a regression class tree with 32 terminal 
nodes, or regression classes:- 

LS "statsfile"^^ 
RC 32 "rtree" ^ 

A further optional argumeiit is possible with the RC command. This argument allows the user 
to specify the non-speech class)miscture components using an itemlist, such as the silence mixture 
components. 

LS "statsfile" 

RC 32 "rtree" {sil . statet2-4] .mix} 

In this case the first split that will be^mde in the regression class tree will be to split the speech 
and non-speech sounds, after which thetpee building continues as usual. 

V: 

10.8 Miscellaneous Opera^i^ns 

The preceding sections have described the mk&TIHED commands used for building continuous 
density systems with tied parameters. A furthe(fgroup of commands (JO, TI and HK) are used to 
build tied-mixture systems and these are describecLin Chapter 11. Those remaining cover a miscel- 
lany of functions. They are documented in the refej^ce entry for HHEd and include commands 
to add and remove state transitions (AT, RT); synth^is* triphones from biphones (MT); change the 
parameter kind of a HMM (SK); modify stream dimenBMjo-s (SS, SU, SW); change/add an identifier 
name to an MMF (RN command); and expand HMM set\ by duplication, for example, as needed in 
making gender dependent models (DP). QO 



'•6 



O 



o 

% 



Chapter 11 



Discrete^nd Tied-Mixture Models 




i 



HInit / HSmooth 
HRest / HERest 

— r^<:. 



HQUANI 



6 

Most of the discussion so far has focussed on using ^^^K to model sequences of continuous- 
valued vectors. In contrast, this chapter is mainly concerne^l with using HTK to model sequences 
of discrete symbols. Discrete symbols arise naturally in modeljfl^ many types of data, for example, 
letters and words, bitmap images, and DNA sequences. Contmu^s signals can also be converted 
to discrete symbol sequences by using a quantiser and in parti&ilar, speech vectors can be vector 
quantised as described in section 5.14. In all cases, HTK expects S-get of N discrete symbols to be 
represented by the contiguous sequence of integer numbers from 1^^^. 

In HTK discrete probabilities are regarded as being closely analog0(ujvto the mixture weights of a 
continuous density system. As a consequence, the representation and rae^sessing of discrete HMMs 
shares a great deal with continuous density models. It follows from this rnat most of the principles 
and practice developed already are equally applicable to discrete system9><A^ a consequence, this 
chapter can be quite brief. \ 

The first topic covered concerns building HMMs for discrete symbol sequences. The use of 
discrete HMMs with speech is then presented. The tool HQuant is described(S^d the method of 
converting continuous speech vectors to discrete symbols is reviewed. This is roikjwed by a brief 
discussion of tied-mixture systems which can be regarded as a compromise betweeir^ntinuous and 
discrete density systems. Finally, the use of the HTK tool HSmooth for paramete^^Smoothing by 
deleted interpolation is presented. 



11.1 Modelling Discrete Sequences 

Building HMMs for discrete symbol sequences is essentially the same as described previously for 
continuous density systems. Firstly, a prototype HMM definition must be specified in order to fix 
the model topology. For example, the following is a 3 state ergodic HMM in which the emitting 
states are fully connected. 

~o <DISCRETE> <StreainInf o> 1 1 
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~h "dproto" 
<BeginHMM> 

<NmnStates> 5 

<State> 2 <NumMixes> 10 

<DProb> 5461*10 
<State> 3 <NumMixes> 10 

<DProb> 5461*10 
<State> 4 <NumMixes> 10 

<DProb> 5461*10 
<TransP> 5 

0.0 1.0 Q^O 0.0 0.0 
0.0 0.3 (f^O.3 0.1 
0.0 0.3 0^^.3 0.1 
0.0 0.3 0.3*0.3 0.1 
0.0 0.0 0.0 O^^^Q.O 
<EndHMM> 

As described in chapter 7, the n^^tton for discrete HMMs borrows heavily on that used for con- 
tinuous density models by equating^ixture components with symbol indices. Thus, this definition 
assumes that each training data sequ^pe contains a single stream of symbols indexed from 1 to 10. 
In this example, all symbols in each sterte have been set to be equally likely^. If prior information 
is available then this can of course be usg^d to set these initial values. 

The training data needed to build a discrete HMM can take one of two forms. It can either be 
discrete (SOURCEKIND=DISCRETE) in whicti^ase it consists of a sequence of 2-byte integer symbol 
indices. Alternatively, it can consist of conti<^bus parameter vectors with an associated VQ code- 
book. This latter case is dealt with in the ne^^section. Here it will be assumed that the data is 
symbolic and that it is therefore stored in discrejW^ form. Given a set of training files listed in the 
script file train. scp, an initial HMM could be estimated using 

HInit -T 1 -w 1.0 -o dhmm -S train, scp rmhininO dproto 

This use of HInit is identical to that which would b^^ed for building whole word HMMs where 
no associated label file is assumed and the whole of eaclfrraining sequence is used to estimate the 
HMM parameters. Its effect is to read in the prototype s1jrt5?pd in the file dproto and then use the 
training examples to estimate initial values for the output distributions and transition probabilities. 
This is done by firstly uniformly segmenting the data and for,^ch segment counting the number 
of occurrences of each symbol. These counts are then normaiised.to provide output distributions 
for each state. HInit then uses the Viterbi algorithm to rese^^Mnt the data and recompute the 
parameters. This is repeated until convergence is achieved or an i^^er limit on the iteration count 
is reached. The transition probabilities at each step are estimated ^Ttoly by counting the number 
of times that each transition is made in the Viterbi alignments and^Qr«ialising. The final model is 
renamed dhmm and stored in the directory hmmO. 

When building discrete HMMs, it is important to floor the discretfiK probabilites so that no 
symbol has a zero probability. This is achieved using the -w option whicl(^ecifies a floor value as 
a multiple of a global constant called MINMIX whose value is 10^^. 

The initialised HMM created by HInit can then be further refined if desired by using HRest 
to perform Baum- Welch re-estimation. It would be invoked in a similar way to-the above except 
that there is now no need to rename the model. For example, \. 

HRest -T 1 -w 1.0 -S train, scp -M hmml hmmO/dhnmi 

would read in the model stored in hmmO/dhmm and write out a new model of the same name to the 
directory hmml. 



11.2 Using Discrete Models with Speech 

As noted in section 5.14, discrete HMMs can be used to model speech by using a vector quantiser 
to map continuous density vectors into discrete symbols. A vector quantiser depends on a so-called 

^ Remember that discrete probabilities are scaled such that 32767 is equivalent to a probability of 0.000001 and 
0 is equivalent to a probability of 1.0 
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codebook which defines a set of partitions of the vector space. Each partition is represented by the 
mean value of the speech vectors belonging to that partition and optionally a variance representing 
the spread. Each incoming speech vector is then matched with each partition and assigned the 
index corresponding to the partition which is closest using a Mahanalobis distance metric. 

In HTK such a codebook can be built using the tool HQuant. This tool takes as input a set of 
continuous speech vectors, clusters them and uses the centroid and optionally the variance of each 
cluster to define the partitions. HQuant can build both linear and tree structured codebooks. 
To build a linear codebook, all training vectors are initially placed in one cluster and the mean 
calculated. The mean is then perturbed to give two means and the training vectors are partitioned 
according to which H^g^n is nearest to them. The means are then recalculated and the data is 
repartitioned. At each^cle, the total distortion (i.e. total distance between the cluster members 
and the mean) is recoraS and repartitioning continues until there is no significant reduction in 
distortion. The whole prcrcess then repeats by perturbing the mean of the cluster with the highest 
distortion. This continues i!ntil the required number of clusters have been found. 

Since all training vectors c(j^eallocated at every cycle, this is an expensive algorithm to compute. 
The maximum number of itergcfrons within any single cluster increment can be limited using the 
configuration variable MAXCLUSflTKR and although this can speed-up the computation significantly, 
the overall training process is stnf^mputationally expensive. Once built, vector quantisation is 
performed by scanning all codeboolventries and finding the nearest entry. Thus, if a large codebook 
is used, the run-time VQ look-up op^ltion can also be expensive. 

As an alternative to building a lin^^ codebook, a tree-structured codebook can be used. The 
algorithm for this is essentially the samst^ above except that every cluster is split at each stage so 
that the first cluster is split into two, thw are split into four and so on. At each stage, the means 
are recorded so that when using the codesMXik for vector quantising a fast binary search can be 
used to find the appropriate leaf cluster. Tre^tructured codebooks are much faster to build since 
there is no repeated reallocation of vectors aiJVnuch faster in use since only 0(log2 N) distance 
need to be computed where N is the size of the ^Sebook. Unfortunately, however, tree-structured 
codebooks will normally incur higher VQ distortion>ft)r a given codebook size. 

When delta and acceleration coefficients are used, iSris usually best to split the data into multiple 
streams (see section 5.13. In this case, a separate coSdehook is built for each stream. 

As an example, the following invocation of HQu^'J^^ould generate a linear codebook in the 
file linvq using the data stored in the files listed in vq.(^cp. 

HQuant -C config -s 4 -n 3 64 -n 4 16 -S vq.s^^ linvq 

Here the configuration file config specifies the TARGETKIND >^^eing MFCC_E_D_A i.e. static coefll- 
cients plus deltas plus accelerations plus energy. The -s options rejjuests that this parameterisation 
be split into 4 separate streams. By default, each individual csdebook has 256 entries, however, 
the -n option can be used to specify alternative sizes. W 

If a tree-structured codebook was wanted rather than a linear C(^^)^ook, the -t option would be 
set. Also the default is to use Euclidean distances both for building the^^debook and for subsequent 
coding. Setting the -d option causes a diagonal covariance MahalanobiS^etric to be used and the 
-f option causes a full covariance Mahalanobis metric to be used. 

<^ 
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Once the codebook is built, normal speecjb' vector files can be converted to discrete files using 
HCOPY. This was explained previously in sectt^ 5.14. The basic mechanism is to add the qualifier 
_V to the TARGETKIND. This causes HParm to a];rt5fflnd a codebook index to each constructed obser- 
vation vector. If the configuration variable SAVEASy^Ms set true, then the output routines in HParm 
will discard the original vectors and just save the Vtj i»dices in a DISCRETE file. Alternatively, HTK 
will regard any speech vector with _V set as being COTinatible with discrete HMMs. Thus, it is not 
necessary to explicitly create a database of discrete tra^jSng files if a set of continuous speech vector 
parameter files already exists. Fig. 11.1 illustrates this process. 

Once the training data has been configured for discret^JJjlMMs, the rest of the training process 
is similar to that previously described. The normal sequence is to build a set of monophone models 
and then clone them to make triphones. As in continuous densj^ systems, state tying can be used 
to improve the robustness of the parameter estimates. HowWer,. in the case of discrete HMMs, 
alternative methods based on interpolation are possible. These N^cfe discussed in section 11.4. 

o 

Discrete systems have the advantage of low run-time computation. Bfwtever, vector quantisation 
reduces accuracy and this can lead to poor performance. As a intermedierfce between discrete and 
continuous, a fully tied-mixturc system can be used. Tied-mixtures are ctmjs^ptually just another 
example of the general parameter tying mechanism built-in to HTK. Howevei>to use them effectively 
in speech recognition systems a number of storage and computational optimisations must be made. 
Hence, they are given special treatment in HTK. 
When specific mixtures are tied as in 

TI "mix" {*.state[2] .mix[l]} 

then a Gaussian mixture component is shared across all of the owners of the tie. In this example, 
all models will share the same Gaussian for the first mixture component of state 2. However, if the 
mixture component index is missing, then all of the mixture components participating in the tie 
are joined rather than tied. More specifically, the commands 



11.3 Tied Mixture Systems 



JO 128 2.0 

TI "mix" {*. state [2-4] .mix} 



has the following effect. All of the mixture components in states 2 to 4 of all models are collected into 
a pool. If the number of components in the pool exceeds 128, as set by the preceding join command 
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JD, then components with the smallest weights are removed until the pool size is exactly 128. 
Similarly, if the size of the initial pool is less than 128, then mixture components are split using the 
same algorithm as for the Mix- Up MU command. All states then share all of the mixture components 
in this pool. The new mixture weights are chosen to be proportional to the log probability of the 
corresponding new mixture component mean with respect to the original distribution for that state. 
The log is used here to give a wider spread of mixture weights. All mixture weights are floored to 
the value of the second argument of the JD command times MINMIX. 

The net effect of the above two commands is to create a set of tied-mixture HMMs^ where 
the same set of mixture components is shared across all states of all models. However, the type of 
the HMM set so crea^jj will still be SHARED and the internal representation will be the same as 
for any other set of paj^tneter tyings. To obtain the optimised representation of the tied-mixture 
weights described in sect^i^T.S, the following HHEd HK command must be issued 

HK TIEDHS , 

This will convert the interna5i^pesentation to the special tied-mixture form in which all of the tied 
mixtures are stored in a globaP'Sble and referenced implicitly instead of being referenced explicitly 
using pointers. 

Tied-mixture HMMs work be^fe-^ the information relating to different sources such as delta 
coeflicients and energy are separatSorfnto distinct data streams. This can be done by setting up 
multiple data stream HMMs from tnejjutset. However, it is simpler to use the SS command in 
HHEd to split the data streams of the sikrrently loaded HMM set. Thus, for example, the command 

SS 4 ^ \* 

would convert the currently loaded HMMs tcyl&e four separate data streams rather than one. When 
used in the construction of tied-mixture HMMarthis is analogous to the use of multiple codebooks 
in discrete density HMMs. 

The procedure for building a set of tied-mixt>ire IJMMs may be summarised as follows 

X 

1. Choose a codebook size for each data stream and^hen decide how many Gaussian components 
will be needed from an initial set of monophones to approximately fill this codebook. For 
example, suppose that there are 48 three stateVn^oaophones. If codebook sizes of 128 are 
chosen for streams 1 and 2, and a codebook siz^of 64 is chosen for stream 3 then single 
Gaussian monophones would provide enough mixtui^in total to fill the codebooks. 

2. Train the initial set of monophones. * > 

3. Use HHEd to first split the HMMs into the required number of data streams, tie each indi- 
vidual stream and then convert the tied-mixture HMM setto-iiave the kind TIEDHS. A typical 
script to do this for four streams would be 

o 

% 
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JD 


256 2.0 
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stl {*.state[2- 
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. stream[l] 
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JD 
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st2 {*.state[2- 
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. stream [2] 


.mix} 
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128 2.0 








TI 


st3 {*.state[2- 


-4] 


. stream [3] 


.mix} 


JD 


64 2.0 








TI 


st4 {*.state[2- 


-4] 


. stream [4] 


.mix} 


HK 


TIEDHS 









4. Re-estimate the models using HERest in the normal way. 

Once the set of retrained tied-mixture models has been produced, context dependent models can 
be constructed using similar methods to those outhned previously. 

When evaluating probabilities in tied-mixture systems, it is often sufficient to sum just the most 
likely mixture components since for any particular input vector, its probability with respect to 
many of the Gaussian components will be very low. HTK tools recognise TIEDHS HMM sets as 
being special in the sense that additional optimisations are possible. When full tied- mixtures are 



^Also called semi-continuous HMMs in the the literature. 
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used, then an additional layer of pruning is applied. At each time frame, the log probability of the 
current observation is computed for each mixture component. Then only those components which 
lie within a threshold of the most likely component are retained. This pruning is controlled by the 
-c option in HRest, HERest and HVite. 



11.4 Parameter Smoothing 



When large sets of context-dependent triphones are built using discrete models or tied-mixture 
models, under-training can be a severe problem since each state has a large number of mixture 
weight parameters to^'^imate. The HTK tool HSmooth allows these discrete probabilities or 
mixture component we^^s to be smoothed with the monophone weights using a technique called 
deleted interpolation. 

HSmooth is used in cc^bination with HERest working in parallel mode. The training data 
is split into blocks and each, block is used separately to re-estimate the HMMs. However, since 
HERest is in parallel mode, VpjAputs a dump file of accumulators instead of updating the models. 
HSmooth is then used in plaee of the second pass of HERest. It reads in the accumulator 
information from each of the blo^^, performs deleted interpolation smoothing on the accumulator 
values and then outputs the re-esfmKted HMMs in the normal way. 

HSmooth implements a convenlj^jial deleted interpolation scheme. However, optimisation of 
the smoothing weights uses a fast bin^w chop scheme rather than the more usual Baum- Welch 
approach. The algorithm for finding tnte optimal interpolation weights for a given state and stream 
is as follows where the description is giveijrin terms of tied-mixture weights but the same applies to 
discrete probabilities. \^ 

Assume that HERest has been set-up tt^utput N separate blocks of accumulators. Let 
be the i'th mixture weight based on the acci^^ijilator blocks 1 to but excluding block n, and 
let Wj"^ be the corresponding context independ^t weight. Let be the i'th mixture weight 
count for the deleted block n. The derivative orth^^og likelihood of the deleted block, given the 
probability distribution with weights Ci = Xwi + (^-^^)wi is given by 



1 

0 



(") 



(11.1) 



Since the log likelihood is a convex function of A, this derivative allows the optimal value of A to 
be found by a simple binary chop algorithm, viz. 



function FindLambdaOpt : 

if (D(0) <= 0) return 0; 
if (D(l) >= 0) return = 1; 
1=0; r=l; 

for (k=l; k<=maxStep; k++){ 
m = (l+r)/2; 

if (D(m) == 0) return m; 
if (D(m) > 0) l=m; else r= 

> 

return m; 



o 

o 



HSmooth is invoked in a similar way to HERest. For example, suppose list the directory 
hnmi2 contains a set of accumulator files output by the first pass of HERest ruHping in parallel 
mode using as source the HMM definitions listed in hlist and stored in hmml/HMME^s. Then the 
command 



HSmooth -c 4 -w 2.0 -H hinml/HMMDef s -M hinm2 hlist hiiim2/*.acc 



would generate a new smoothed HMM set in hmm2. Here the -w option is used to set the minimum 
mixture component weight in any state to twice the value of MINMIX. The -c option sets the 
maximum number of iterations of the binary chop procedure to be 4. 
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The preceding chapters have described how to process speec^^ata and how to train various 
types of HMM. This and the following chapter are concerned witlr^building a speech recogniser 
using HTK. This chapter focuses on the use of networks and dimSnaries. A network describes 
the sequence of words that can be recognised and, for the case of suB^word systems, a dictionary 
describes the sequence of HMMs that constitute each word. A wordHeiwl network will typically 
represent either a Task Grammar which defines all of the legal word sequ^n^s explicitly or a Word 
Loop which simply puts all words of the vocabulary in a loop and therefer^ allows any word to 
follow any other word. Word-loop networks are often augmented by a stochastic language model. 
Networks can also be used to define phone recognisers and various types of worci-snotting systems. 

Networks are specified using the HTK Standard Lattice Format (SLF) wMcn is described in 
detail in Chapter 20. This is a general purpose text format which is used for rep^§fenting multiple 
hypotheses in a recogniser output as well as word networks. Since SLF format is tS^5jbased, it can 
be written directly using any text editor. However, this can be rather tedious and tITK provides 
two tools which allow the application designer to use a higher-level representation. Firstly, the 
tool HParse allows networks to be generated from a source text containing extended BNF format 
grammar rules. This format was the only grammar definition language provided in earlier versions 
of HTK and hence HParse also provides backwards compatibility. 

HParse task grammars are very easy to write, but they do not allow fine control over the actual 
network used by the recogniser. The tool HBuild works directly at the SLF level to provide this 
detailed control. Its main function is to enable a large word network to be decomposed into a set 
of small self-contained sub-networks using as input an extended SLF format. This enhances the 
design process and avoids the need for unnecessary repetition. 
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HBuiLD can also be used to perform a number of special-purpose functions. Firstly, it can 
construct word-loop and word-pair grammars automatically. Secondly, it can incorporate a sta- 
tistical bigram language model into a network. These can be generated from label transcriptions 
using HLStats. However, HTK supports the standard ARPA MIT-LL text format for backed-off 
N-gram language models, and hence, import from other sources is possible. 

Whichever tool is used to generate a word network, it is important to ensure that the generated 
network represents the intended grammar. It is also helpful to have some measure of the difficulty 
of the recognition task. To assist with this, the tool HSGen is provided. This tool will generate 
example word sequences from an SLF network using random sampling. It will also estimate the 
perplexity of the net'v^Jt. 

When a word netwQj^^is loaded into a recogniser, a dictionary is consulted to convert each word 
in the network into a sequ^ce of phone HMMs. The dictionary can have multiple pronunciations in 



which case several sequen(res may be joined in parallel to make a word. Options exist in this process 
to automatically convert th% dictionary entries to context-dependent triphone models, either within 
a word or cross-word. Prono\iJlfeing dictionaries are a vital resource in building speech recognition 
systems and, in practice, word'pronunciations can be derived from many different sources. The 
HTK tool HDMan enables a Sktijonary to be constructed automatically from different sources. 
Each source can be individually ^fit^d and translated and merged to form a uniform HTK format 
dictionary. 

The various facilities for describi^)a word network and expanding into a HMM level network 
suitable for building a recogniser are in(^mented by the HTK library module HNet. The facilities 
for loading and manipulating dictionarie^j^e implemented by the HTK library module HDiCT and 
for loading and manipulating language mSde^s-are implemented by HLM. These facilities and those 
provided by HParse, HBuild, HSGen, HL^ATS and HDMan are the subject of this chapter. 

12.1 How Networks are Use^ 

Before delving into the details of word networks ictionaries, it will be helpful to understand 

their role in building a speech recogniser using HTK^ Fig 12.1 illustrates the overall recognition 
process. A word network is defined using HTK StaHo^rd Lattice Format (SLF). An SLF word 
network is just a text file and it can be written directlyiJ^h a text editor or a tool can be used to 
build it. HTK provides two such tools, HBuild and HParse. These both take as input a textual 
description and output an SLF file. Whatever method isWosen, word network SLF generation is 
done off-line and is part of the system build process. • v 

An SLF file contains a list of nodes representing words and(a)list of arcs representing the transi- 
tions between words. The transitions can have probabilities attadi«i to them and these can be used 
to indicate preferences in a grammar network. They can also be used to represent bigram probabil- 
ities in a back-off bigram network and HBuild can generate such Sniigram network automatically. 
In addition to an SLF file, a HTK recogniser requires a dictionary t'Q^apply pronunciations for each 
word in the network and a set of acoustic HMM phone models. Dictid^ties are input via the HTK 
interface module HDiCT. V'C^ 

The dictionary, HMM set and word network are input to the HTK libcaiw module HNet whose 
function is to generate an equivalent network of HMMs. Each word in l^g»rtiictionary may have 
several pronunciations and in this case there will be one branch in the neWork corresponding to 
each alternative pronunciation. Each pronunciation may consist either of a Itet of phones or a list 
of HMM names. In the former case, HNet can optionally expand the HMM n^^rk to use either 
word internal triphones or cross-word triphones. Once the HMM network has be^iiNconstructed, it 
can be input to the decoder module HRec and used to recognise speech input. Ijwte that HMM 
network construction is performed on-line at recognition time as part of the initialisation process. 
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For convenience, HTK provides a recognition tool callel^JlViTE to allow the functions provided 
by HNet and HRec to be invoked from the command l^ne. HVite is particularly useful for 
running experimental evaluations on test speech stored in disfc^les and for basic testing using live 
audio input. However, application developers should note trax HVite is just a shell containing 
calls to load the word network, dictionary and models; generatevthe recognition network and then 
repeatedly recognise each input utterance. For embedded applicaCions, it may well be appropriate 
to dispense with HVite and call the functions in HNet and HR^^^^rectly from the application. 
The use of HVite is explained in the next chapter. -^J^ 

. ^ 

12.2 Word Networks and Standard Lattice Format 

This section provides a basic introduction to the HTK Standard Lattice Forrnat (SLF). SLF files are 
used for a variety of functions some of which lie beyond the scope of the standaSi HTK package. The 
description here is limited to those features of SLF which are required to descijire word networks 
suitable for input to HNet. The following Chapter describes the further feat^r^s of SLF used 
for representing the output of a recogniser. For reference, a full description of 'SBF is given in 
Chapter 20. 

A word network in SLF consists of a list of nodes and a list of arcs. The nodes represent words 
and the arcs represent the transition between words^. Each node and arc definition is written on a 
single line and consists of a number of fields. Each field specification consists of a "name= value" 
pair. Field names can be any length but all commonly used field names consist of a single letter. By 
convention, field names starting with a capital letter are mandatory whereas field names starting 
with a lower-case letter are optional. Any line beginning with a # is a comment and is ignored. 



^More precisely, nodes represent the ends of words and arcs represent the transitions between word ends. This 
distinction becomes important when describing recognition output since acoustic scores are attached to arcs not 
nodes. 
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Simple Word Network 



-atun nodes and L=num arcs 



J=arc-number , S=start-nside , E=end-node 



Tlie following example sh^^d illustrate the basic format of an SLF word network file. It corre- 
sponds to the network illustrat^<i\in Fig 12.2 which represents all sequences consisting of the words 
"bit" and "but" starting with f4Tfi>word "start" and ending with the word "end". As will be seen 
later, the start and end words wiKop mapped to a silence model so this grammar allows speakers 
to say "bit but but bit bit ....etc".\^ 

# Define size of network: 
N=4 L=8 

# List nodes: I=node-number , vf^ord 
1=0 W=start y^' 
1=1 W=end 

1=2 W=bit "^rx 
1=3 W=but ^ ' 

# List arcs: 
J=0 S=0 E=2 
J=l S=0 E=3 
J=2 S=3 E=l 
J=3 S=2 E=l 
J=4 S=2 E=3 
J=5 S=3 E=3 
J=6 S=3 E=2 
J=7 S=2 E=2 

Notice that the first line which defines the size of the network'^nust be given before any node or 
arc definitions. A node is a network start node if it has no predec^sgbrs, and a node is network end 
node if it has no successors. There must be one and only one netwc^B^start node and one network 
end node. In the above, node 0 is a network start node and nodeXfie^a network end node. The 
choice of the names "start" and "end" for these nodes has no significanct 
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Fig. 12.3 A Word Network Using Null 
Nodes 



A word network can have null nodes indicated by the special predefined word name ! NULL. Null 
nodes are useful for reducing the number of arcs required. For example, the Bit-But network could 
be defined as follows 



# Network using null nodes 
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N=6 


L=7 


1=0 


W=start 


1=1 


W=end 


1=2 


W=bit 


1=3 


W=but 


1=4 


W= ! NULL 


1=5 


W= ! NULL 


J=0 


S=0 E=4 


J=l 


S=4 E=2 


J=2 


S=4 E=3 


J=3 


S=2 E=5 


J=4 


S=3 E=5 


J=5 


S=5 E=4 


J=6 


S=5 E=l 



In this case, there is no signi^^^rit saving, however, if there were many words in parallel, the total 
number of arcs would be much Educed by using null nodes to form common start and end points 
for the loop-back connections. 

By default, all arcs are equally ^AlEEly. However, the optional field l=x can be used to attach the 
log transition probability x to an aro'^f'or example, if the word "but" was twice as likely as "bit", 
the arcs numbered 1 and 2 in the lasr^t^mple could be changed to 

J=l S=4 E=2 1=-1.1 
J=2 S=4 E=3 l=-0.4 

Here the probabilities have been normalisodjfo sum to 1, however, this is not necessary. The 
recogniser simply adds the scaled log probabiUtrsto the path score and hence it can be regarded as 
an additive word transition penalty. 

12.3 Building a Word Networl^^ith HParse 

Whilst the construction of a word level SLF network fi hand is not difficult, it can be somewhat 
tedious. In earlier versions of HTK, a high level gramma^notation based on extended Backus-Naur 
Form (EBNF) was used to specify recognition grammars, ^^is HParse format was read-in directly 
by the recogniser and compiled into a finite state recognition network at run-time. 

In HTK 3.4, HParse format is still supported but in thelra^ of an off-line compilation into an 
SLF word network which can subsequently be used to drive a^«cogniser. 

A HParse format grammar consists of an extended form of^egular expression enclosed within 
parentheses. Expressions are constructed from sequences of word^3hd the metacharacters 

I denotes alternatives 

[ ] encloses options V^^) 
{ } denotes zero or more repetitions 
< > denotes one or more repetitions 

<< >> denotes context-sensitive loop * 

The following examples will illustrate the use of all of these except the last which is/^repecial-purpose 
facility provided for constructing context-sensitive loops as found in for example, context-dependent 
phone loops and word-pair grammars. It is described in the reference entry for HP^SE. 

As a first example, suppose that a simple isolated word single digit recogniser was required. A 
suitable syntax would be 

( 

one I two I three I four I five I 
six I seven I eight I nine I zero 

) 

This would translate into the network shown in part (a) of Fig. 12.4. If this HParse format syntax 
definition was stored in a file called digit syn, the equivalent SLF word network would be generated 
in the file digitnet by typing 
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HParse digitsyn digitnet 

The above digit syntax assumes that each input digit is properly end-pointed. This requirement 
can be removed by adding a silence model before and after the digit 



( 



sil (one I two I three I four I five I 
six I seven I eight I nine I zero) sil 



As shown by graph (b^jj Fig. 12.4, the allowable sequence of models now consists of silence followed 
by a digit followed by ^^nce. If a sequence of digits needed to be recognised then angle brackets 
can be used to indicate oj^^or more repetitions, the HParse grammar 

( 

sil < one I two JOfhree I four I five I 
six I seven I eignWl nine I zero > sil 

) . 

would accomplish this. Part (c) oiyJ^. 12.4 shows the network that would result in this 

(a) , ^ 0\) 



case. 



two ) -^ 



^^-►( ^one y 

two ) -> 




Fig. 12.4 Example Digit Recognition^^^^works 

HParse grammars can define variables to represent sub-expressionlfAj^riable names start with 
a dollar symbol and they are given values by definitions of the form 



!t>var = expression ; 

For example, the above connected digit grammar could be rewritten as 



$digit 
( 



one I two I three I four I five I 
six I seven I eight I nine I zero; 



o 

% 



sil < $digit > sil 



) 



Here $digit is a variable whose value is the expression appearing on the right hand side of the 
assignment. Whenever the name of a variable appears within an expression, the corresponding 
expression is substituted. Note however that variables must be defined before use, hence, recursion 
is prohibited. 

As a final refinement of the digit grammar, the start and end silence can be made optional by 
enclosing them within square brackets thus 
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$digit = one I two I three I four I five I 
six I seven I eight I nine I zero; 

( 

[sil] < $digit > [sil] 

) 

Part (d) of Fig. 12.4 shows the network that would result in this last case. 

HParse format grammars are a convenient way of specifying task grammars for interactive voice 
interfaces. As a final example, the following defines a simple grammar for the control of a telephone 
by voice. 

$digit = one '^iwo I three I four I five I 

six I <^jven I eight I nine I zero; 
$number = $digit { [pause] $digit}; 
$scode = shortco(?e Sdigit $digit; 
$telnum = $scode I l$B.urober; 
$cmd = dial $telni!i^| 

enter $scode([^r $number 



Snoise = lipsmack I breS^Jj>^ I background; 
( < $cmd I $noise > ) 



redial I canc^i^ 

5 I breStte>^ I 

^ . C5 

The dictionary entries for pause, lipsma^lf\ breath and background would reference HMMs trained 
to model these types of noise and the COTrasponding output symbols in the dictionary would be 
null. V. 

Finally, it should be noted that when th^^AParse format was used in earlier versions of HTK, 
word grammars contained word pronunciatioJ5^\mbedded within them. This was done by using 
the reserved node names WD_BEGIN and WD_END 1^d*^elimit word boundaries. To provide backwards 
compatiblity, HParse can process these old formM>networks but when doing so it outputs a dic- 
tionary as well as a word network. This compatibillty^ode is defined fully in the reference section, 
to use it the configuration variable VICOMPAT must be sgt true or the -c option set. 

Finally on the topic of word networks, it is impor^^t to note that any network containing an 
unbroken loop of one or more tee-models will generate a^^rror. For example, if sp is a single state 
tee-model used to represent short pauses, then the foUowi^^ network would generate an error 

( sil < sp I $digit > sil ) • v 

the intention here is to recognise a sequence of digits which rflSy rationally be separated by short 
pauses. However, the syntax allows an endless sequence of spMiiodels and hence, the recogniser 
could traverse this sequence without ever consuming any input, solution to problems such as 
these is to rearrange the network. For example, the above could b^^^itten as 

( sil < $digit sp > sil ) 

12.4 Bigram Language Models 

Before continuing with the description of network generation and, in particular, the use of HBuild, 
the use of bigram language models needs to be described. Support for statistical(S)riguage models in 
HTK is provided by the library module HLM. Although the interface to HLM ca^T^upport general 
N-grams, the facilities for constructing and using N-grams are limited to bigrams.^^^ 

A bigram language model can be built using HLStats invoked as follows where'^is a assumed 
that all of the label files used for training are stored in an MLF called labs 

HLStats -b bigfn -o wordlist labs 

All words used in the label files must be listed in the wordlist. This command will read all of 
the transcriptions in labs, build a table of bigram counts in memory, and then output a back-off 
bigram to the file bigfn. The formulae used for this are given in the reference entry for HLStats. 
However, the basic idea is encapsulated in the following formula 



{N{i,j)-D)/N{i) iiN{i,j)>t 
^{i)p{3) otherwise 
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where N(i,j) is the number of times word j follows word i and N{i) is the number of times that word 
i appears. Essentially, a small part of the available probability mass is deducted from the higher 
bigram counts and distributed amongst the infrequent bigrams. This process is called discounting. 
The default value for the discount constant D is 0.5 but this can be altered using the configuration 
variable DISCOUNT. When a bigram count falls below the threshold t, the bigram is backed-off to 
the unigram probability suitably scaled by a back-off weight in order to ensure that all bigram 
probabilities for a given history sum to one. 

Backed-off bigrams are stored in a text file using the standard ARPA MIT-LL format which as 
used in HTK is as follows 



\data\ 

ngram l=<nuin 
ngram 2=<num 



X 



-ngi 



s> 
i>is> 



\1 -grams : 
P ( ! ENTER) 
P(W1) 
P(W2) 

P(!EXIT) 

\2-grcmis : 
P(W1 
P(W2 
P(W1 
P(W2 
P(W1 

P(!EXIT 
P(!EXIT 
\end\ 



! ENTERy-^( ! ENTER) 
Wl ^(-Wl) 



W2 



!EXIT 



!>) 



B(!^T) 

CO 



! ENTER) 


! ENTER 


Wl 


! ENTER) 


! ENTER 


W2 


Wl) 


Wl 


Wl 


Wl) 


Wl 


W2 


W2) 


W2 


Wl 


1 Wl) 


Wl 


!EXIT 


1 W2) 


W2 


!EXIT 



! ENTER and 



where all probabilities are stored as base-10 logs. The aafeult start and end words, 
!EXIT can be changed using the HLStats -s option. 

For some applications, a simple matrix style of bigram refjres^ntation may be more appropriate. 
If the -o option is omitted in the above invocation of HLSt.^|)^ then a simple full bigram matrix 
will be output using the format 



! ENTER 

Wl 

W2 

!EXIT 



P(W1 
P(W1 
P(W1 

PN 



! ENTER) 

Wl) 

W2) 



P(W2 
P(W2 
P(W2 

PN 



! ENTER) 

Wl) 

W2) 



. O 

o. 



Wi) is given by row i, j of the matrix. If there a^^ total of N words in 

this ensures that the last row sums to one. 



where the probability P{wj 

the vocabulary then PN in the above is set to 1/(A^-|- 1) 
As a very crude form of smoothing, a fioor can be set using the -f minp option taj)revent any entry 
falling below minp. Note, however, that this does not affect the bigram entriesHrr the first column 
which are zero by definition. Finally, as with the storage of tied-mixture and discQte probabilities, 
a run-length encoding scheme is used whereby any value can be followed by an astef$K)and a repeat 
count (see section 7.5). 
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one 
two 



point )- 



one 
two 



Fig. 12.5 Decimal Syntax 



As mentioned in the intfoduction, the main function of HBuild is allow a word-level network to 
be constructed from a main iX^c^ and a set of sub-lattices. Any lattice can contain node definitions 
which refer to other lattices. allows a word-level recognition network to be decomposed into a 
number of sub-networks which casAbe reused at different points in the network. 

For example, suppose that d^^al number input was required. A suitable network structure 
would be as shown in Fig. 12.5. HfJwever, to write this directly in an SLF file would require the 
digit loop to be written twice. This oaH be avoided by defining the digit loop as a sub-network and 
referencing it within the main rfecimaZ^^twork as follows 

\ 



# Digit network 
SUBLAT=digits 
N=14 L=21 

# define digits 
1=0 W=zero 
1=1 W=one 
1=2 W=two 



1=9 W=nine 

# enter/exit & loop-back null nodes 
1=10 W=!NULL 

1=11 W=!NULL 
1=12 W=!NULL 
1=13 W=!NULL 

# null->null->digits 
J=0 S=10 E=ll 

J=l S=ll E=0 
J=2 S=ll E=l 

J=10 S=ll E=9 

# digits->null->null 
J=ll S=0 E=12 

J=19 S=9 E=12 
J=20 S=12 E=13 

# finally add loop back 
J=21 S=12 E=ll 



6 
(J) 



O 

o 



o 

% 



# Decimal netork 
N=5 L=4 

# digits -> point -> digits 
1=0 W=start 

1=1 L=digits 
1=2 W=pause 
1=3 L=digits 
1=4 W=end 



12.6 Testing a Word Network using HSGen 



174 



# digits -> point -> digits 
J=0 S=0 E=l 
J=l S=l E=2 
J=2 S=2 E=3 
J=3 S=3 E=4 

The sub-network is identified by the field SUBLAT in the header and it is terminated by a single 
period on a line by itself. The main body of the sub-network is written as normal. Once defined, a 
sub-network can be substituted into a higher level network using an L field in a node definition, as 
in nodes 1 and 3 of t^jdecimal network above. 

Of course, this process can be continued and a higher level network could reference the decimal 
network wherever it neeW^ decimal number entry. 



P ( w j I w j ) ie full bigram 



B(wj) 

backoff 
weight 




Fig. 12.6 Back-off Bigram Loop Network 



One of the commonest form of recognition network is the \^ojd-loop where all vocabulary items 
are placed in parallel with a loop-back to allow any word sequ^a^e to be recognised. This is the 
basic arrangement used in most dictation or transcription applications. HBuiLD can build such a 
loop automatically from a list of words. It can also read in a bigrarn~u3>fiither ARPA MIT-LL format 
or HTK matrix format and attach a bigram probability to each ^wfu transition. Note, however, 
that using a full bigram language model means that every distinct paii^f words must have its own 
unique loop-back transition. This increases the size of the network con^i^e^ably and slows down the 
recogniser. When a back-off bigram is used, however, backed-off transiti|^^ can share a common 
loop-back transition. Fig. 12.6 illustrates this. When backed-off bigrams aj^ input via an ARPA 
MIT-LL format file, HBuild will exploit this where possible. ^ 

Finally, HBuiLD can automatically construct a word-pair grammar as usefl in the ARPA Naval 
Resource Management task. 
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When designing task grammars, it is useful to be able to check that the language defined by the 
final word network is as envisaged. One simple way to check this is to use the network as a generator 
by randomly traversing it and outputting the name of each word node encountered. HTK provides 
a very simple tool called HSGen for doing this. 

As an example if the file bnet contained the simple Bit-But nctword described above and the 
file bdic contained a corresponding dictionary then the command 



HSGen bnet bdic 



would generate a random list of examples of the language defined by bnet, for example, 
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start bit but bit bit bit end 
start but bit but but end 
start bit bit but but end 
.... etc 

This is perhaps not too informative in this case but for more complex grammars, this type of output 
can be quite illuminating. 

HSGen will also estimate the empirical entropy by recording the probability of each sentence 
generated. To use this facihty, it is best to suppress the sentence output and generate a large 
number of examples, ^^r example, executing 



HSGen -s -n lOO^aq bnet bdic 

where the -s option requ^^s statistics, the -q option suppresses the output and -n 1000 asks for 
1000 sentences would geneifete the following output 

Number of Nodes = 4 \^-iAill] , Vocab Size = 4 

Entropy = 1.156462, P«fulexity = 2.229102 

1000 Sentences: average^^n> = 5.1, min=3, max=19 

12.7 Constructing a dictionary 

As explained in section 12.1, the word leX^network is expanded by HNet to create the network of 
HMM instances needed by the recogniser.vT)ife way in which each word is expanded is determined 
from a dictionary. 

A dictionary for use in HTK has a veryi^imple format. Each line consists of a single word 
pronunciation with format \J 

WORD [ '['OUTSYM']' ] [PRONPROB] PI F2^3 P4 

where WORD represents the word, followed by the opt^^al parameters OUTSYM and PRONPROB, where 
OUTSYM is the symbol to output when that word is rac^nised (which must be enclosed in square 
brackets, [ and ]) and PRONPROB is the pronunciatioli^obability (0.0 - 1.0). PI, P2, ...is the 
sequence of phones or HMMs to be used in recognising- tiiat word. The output symbol and the 
pronunciation probability are optional. If an output symboVis not specified, the name of the word 
itself is output. If a pronunciation probability is not specified^then a default of 1.0 is assumed. 
Empty square brackets, [] , can be used to suppress any outjt^^when that word is recognised. For 
example, a dictionary might contain 

bit b ih t O 

but b ah t 

dog [woof] d ao g 



cat [meow] k ae t "Zy^ 
start [] sil "^^r^ 

end [] sil O 



:^^n 



If any word has more than one pronunciation, then the word has a repeated entry, for example, 

the th iy 

the th ax 

corresponding to the stressed and unstressed forms of the word "the" . ''^^ 

The pronunciations in a dictionary are normally at the phone level as in the above examples. 
However, if context-dependent models are wanted, these can be included directly in the dictionary. 
For example, the Bit-But entries might be written as 

bit b+ih b-ih+t ih-t 

but b+a±L b-ah+t ah-t 

In principle, this is never necessary since HNet can perform context expansion automatically, 
however, it saves computation to do this off-line as part of the dictionary construction process. Of 
course, this is only possible for word-internal context dependencies. Cross-word dependencies can 
only be generated by HNet. 
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Srcl.ded Srcl 



Src2.ded Src2 



Src3.ded Src3 



Word 
List 




Dictionary -^linstruction using HDMan 

^ - . 

Pronouncing dictionaries are a valuable resou\5e and if produced manually, they can require 
considerable investment. There are a number of coi^Sercial and public domain dictionaries avail- 
able, however, these will typically have differing forma^and will use different phone sets. To assist 
in the process of dictionary construction, HTK provicMS^ tool called HDMan which can be used 
to edit and merge differing source dictionaries to form a angle uniform dictionary. The way that 
HDMan works is illustrated in Fig. 12.7. Qv 

Each source dictionary file must have one pronunciation «per line and the words must be sorted 
into alphabetical order. The word entries must be valid HTK\<mngs as defined in section 4.6. If an 
arbitrary character sequence is to be allowed, then the input e^t^ript should have the command 
IM RAW as its first command. ^ 

The basic operation of HDMan is to scan the input streams ana-for each new word encountered, 
copy the entry to the output. In the figure, a word list is also shown^v^Cwis is optional but if included 
HDMan only copies words in the list. Normally, HDMan copies just'^^ first pronunciation that it 
finds for any word. Thus, the source dictionaries are usually arranged i^^t-^^er of reliability, possibly 
preceded by a small dictionary of special word pronunciations. For exampie. in Fig. 12.7, the main 
dictionary might be Src2. Srcl might be a small dictionary containing cbm^t pronunciations for 
words in Src2 known to have errors in them. Finally, Src3 might be a large poor quality dictionary 
(for example, it could be generated by a rule-based text-to- phone system) "vahich is included as a 
last resort source of pronunciations for words not in the main dictionary. 

As shown in the figure, HDMan can apply a set of editing commands to eacn/SCijurce dictionary 
and it can also edit the output stream. The commands available are described in fulp^the reference 
section. They operate in a similar way to those in HLEd. Each set of commands is w^fcxen in an edit 
script with one command per line. Each input edit script has the same name as the corresponding 
source dictionary but with the extension . ded added. The output edit script is stored in a file 
called global . ded. The commands provided include replace and delete at the word and phone 
level, context-sensitive replace and automatic conversions to left biphones, right biphones and word 
internal triphones. 

When HDMan loads a dictionary it adds word boundary symbols to the start and end of each 
pronunciation and then deletes them when writing out the new dictionary. The default for these 
word boundary symbols is # but it can be redefined using the -b option. The reason for this is 
to allow context-dependent edit commands to take account of word-initial and word-final phone 
positions. The examples below will illustrate this. 
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Rather than go through each HDMan edit command in detail, some examples will illustrate 
the typical manipulations that can be performed by HDMan. Firstly, suppose that a dictionary 
transcribed unstressed "-ed" endings as ihO d but the required dictionary does not mark stress but 
uses a schwa in such cases, that is, the transformations 

ihO d # -> ax d 

ihO -> ih (otherwise) 

are required. These could be achieved by the following 3 commands 

MP axdO ihO 

SP axdO ax d # ^ 

RP ih ihO 

The context sensitive replete is achieved by merging all sequences of ihO d # and then splitting 
the result into the sequence d #. The final RP command then unconditionally replaces all 
occurrences of ihO by ih. As^^second similar example, suppose that all examples of ax 1 (as 
in "bottle" ) are to be replacedrt)Y the single phone el provided that the immediately following 
phone is a non-vowel. This requii^ t]ae use of the DC command to define a context consisting of all 
non- vowels, then a merge using MP"^ above followed by a context-sensitive replace 



DC nonv 1 r w y .... m n 
MP axl ax 1 
CR el * axl nonv 
SP axl ax 1 

the final step converts all non-transformed c^a^s of ax 1 back to their original form. 

As a final example, a typical output transfSj^ation applied via the edit script global. ded will 
convert all phones to context-dependent form a(riS append a short pause model sp at the end of 
each pronunciation. The following two commandSy^H do this 

TC 

AS sp 

For example, these commands would convert the dictioi^^n;^ entry 



BAT b ah t 
into 

BAT b+ah b-ah+t ah-t sp 

Finally, if the -1 option is set, HDMan will generate a log fila'^ntaining a summary of the 
pronunciations used from each source and how many words, if any are'wssing. It is also possible to 
give HDMan a phone list using the -n option. In this case, HDMai^-W1 record how many times 
each phone was used and also, any phones that appeared in pronunciat'wWbut are not in the phone 
list. This is useful for detecting errors and unexpected phone symbols in@fe source dictionary. 

12.8 Word Network Expansion 

Now that word networks and dictionaries have been explained, the conversion of wdrH level networks 
to model-based recognition networks will be described. Referring again to Fig 12Jj^-^is expansion 
is performed automatically by the module HNet. By default, HNet attempts to infer the required 
expansion from the contents of the dictionary and the associated list of HMMs. However, 5 con- 
figurations parameters are supplied to apply more precise control where required: ALLOWCXTEXP, 
ALLDWXWRDEXP, FORCECXTEXP, FORCELEFTBl and FORCERIGHTBI. 
The expansion proceeds in four stages. 

1. Context definition 

The first step is to determine how model names are constructed from the dictionary entries 
and whether cross-word context expansion should be performed. The dictionary is scanned 
and each distinct phone is classified as either 
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(a) Context Free 

In this case, the phone is skipped when determining context. An example is a model (sp) 
for short pauses. This will typically be inserted at the end of every word pronunciation 
but since it tends to cover a very short segment of speech it should not block context- 
dependent effects in a cross-word triphone system. 

(b) Context Independent 

The phone only exists in context-independent form. A typical example would be a silence 
model (sil). Note that the distinction that would be made by HNet between sil and 
sp is that whilst both would only appear in the HMM set in context-independent form, 
sil would ^^toear in the contexts of other phones whereas sp would not. 

(c) Context Dep^iHpnt 

This classifica^iH depends on whether a phone appears in the context part of the name 
and whether any,context dependent versions of the phone exist in the HMMSet. Context 
Dependent phone^j(^ll be subject to model name expansion. 

Determination of netwof^^pe 

The default behaviour is tc(^oduce the simplest network possible. If the dictionary is closed 
(every phone name appearsyif^ the HMM list), then no expansion of phone names is per- 
formed. The resulting network'lsvgenerated by straightforward substitution of each dictionary 
pronunciation for each word in Vrreword network. If the dictionary is not closed, then if word 
internal context expansion wouldVpAd each model in the HMM set then word internal context 
expansion is used. Otherwise, full X5^ss- word context expansion is applied. 

The determination of the network ty^^can be modified by using the configuration parameters 
mentioned earher. By default ALLOWCJrt^XP is set true. If ALLOWCXTEXP is set false, then no 
expansion of phone names is performed ^^d each phone corresponds to the model of the 
same name. The default value of ALLDWXWaBEXP is false thus preventing context expansion 
across word boundaries. This also limits tite expansion of the phone labels in the dictionary 
to word internal contexts only. If FDRCECX^SpCP is set true, then context expansion will be 
performed. For example, if the HMM set comained all monophones, all biphones and all 
triphones, then given a monophone dictionary, m& default behaviour of HNet would be to 
generate a monophone recognition network since tb^ dictionary would be closed. However, if 
FDRCECXTEXP is set true and ALLOWXWRDEXP is set rake then word internal context expansion 
will be performed. If FORCECXTEXP is set true and MX,OWXWRDEXP is set true then full cross- 
word context expansion will be performed. • » 

Network expansion \ 

Each word in the word network is transformed into a word^nd node preceded by the sequence 
of model nodes corresponding to the word's pronunciationC3^or cross word context expan- 
sion, the initial and final context dependent phones (and ^^^preceding/following context 
independent ones) are duplicated as many times as is necessa^^to cater for each different 
cross word context. Each duplicated word-final phone is foUowed-^y a similarly duplicated 
word-end node. Null words are simply transformed into word-enaniDdes with no preceding 
model nodes. 

Linking of models to network nodes 

Each model node is linked to the corresponding HMM definition. In each_case, the required 
HMM model name is determined from the phone name and the surroundijig^ context names. 
The algorithm used for this is (_) 



(a) Construct the context-dependent name and see if the corresponding mod^ exists. 

(b) Construct the context-independent name and see if the corresponding model exists. 

If the configuration variable ALLOWCXTEXP is false (a) is skipped and if the configuration 
variable FORCECXTEXP is true (b) is skipped. If no matching model is found, an error is 
generated. When the right context is a boundary or FORCELEFTBI is true, then the context- 
dependent name takes the form of a left biphone, that is, the phone p with left context 
1 becomes 1-p. When the left context is a boundary or FGRCERIGHTBI is true, then the 
context-dependent name takes the form of a right biphone, that is, the phone p with right 
context r becomes p+r. Otherwise, the context-dependent name is a full triphone, that is, 
1-p+r. Context-free phones are skipped in this process so 
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sil aa r sp y uw sp sil 

would be expanded as 

sil sil-aa+r aa-r+y sp r-y+uw y-uw+sil sp sil 

assuming that sil is context-independent and sp is context-free. For word-internal systems, 
the context expansion can be further controlled via the configuration variable CFWORDBOUNDARY. 
When set true (default setting) context-free phones will be treated as word boundaries so 

aa 5/^p y uw sp 
would be expanded'^ 

aa+r aa-^y^p y+uw y-uw sp 
Setting CFWORDBOUNDARY^se would produce 

aa+r aa-r+y 9p~):-y+uw y-uw sp 

Note that in practice, stages (3) and (^YAbove actually proceed concurrently so that for the first and 
last phone of context-dependent modeis'fkigical models which have the same underlying physical 
model can be merged. \P 

— ^ 




Fig. 12.8 Monophone Expansion of Bit-But Network 

Having described the expansion process in some detail, sonVsimple examples will help clarify 
the process. All of these are based on the Bit-But word networl{7]^ustrated in Fig. 12.2. Firstly, 
assume that the dictionary contains simple monophone pronunciat^^^, that is 



bit 

but 

start 

end 



b i 
b u 
sil 
sil 



<6 
O 



and the HMM set consists of just monophones 



u sil 



o 



In this case, HNet will find a closed dictionary. There will be no expansion an$^will directly 
generate the network shown in Fig 12.8. In this figure, the rounded boxes represent model nodes 
and the square boxes represent word-end nodes. 

Similarly, if the dictionary contained word-internal triphone pronunciations such as 



bit 
but 

start 
end 



b+i b-i+t 

b+u b-u+t 
sil 
sil 



i-t 
u-t 



and the HMM set contains all the required models 



b+i b-i+t i-t b+u b-u+t u-t sil 
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then again HNet will find a closed dictionary and the network shown in Fig. 12.9 would be gener- 
ated. 




Fig. 12.9 Word Internal Triphone Expansion of 



Bit- But Network 



If however the dictionary coffl^ned just the simple monophone pronunciations as in the first 
case above, but the HMM set corr^ased just triphones, that is 

sil-b+i t-b+i b-i+t i-t-(^l i-t+b 
sil-b+u t-b+u b-u+t u-t+s^^ u-t+b sil 

then HNet would perform full cross-wora]^xpansion and generate the network shown in Fig. 12.10. 




Fig. 12.10 Cross- Word Triphone Expan^n of Bit-But 

Network 

vP- 

Now suppose that still using the simple monophone pronunciation^p;he HMM set contained all 
monophones, biphones and triphones. In this case, the default would b£,>6\generate the monophone 
network of Fig 12.8. If FDRCECXTEXP is true but ALLOWXWRDEXP is set faJa&.then the word-internal 
network of Fig. 12.9 would be generated. Finally, if both FDRCECXTEXP aWALLOWXWRDEXP are set 
true then the cross-word network of Fig. 12.10 would be generated. C 
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nected word 



Although the recognition facilities of HTK are aimed primarily at sub-word based" 
recognition, it can nevertheless support a variety of other types of recognition systef 

To build a phoneme recogniser, a word- level network is defined using an SLF file in the usual 
way except that each "word" in the network represents a single phone. The structure of the network 
will typically be a loop in which all phones loop back to each other. 

The dictionary then contains an entry for each "word" such that the word and the pronunciation 
are the same, for example, the dictionary might contain 

ih ih 
eh eh 
ah all 
. . . etc 
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Phoneme recognisers often use biphones to provide some measure of context-dependency. Pro- 
vided that the HMM set contains all the necessary biphones, then HNet will expand a simple phone 
loop into a context-sensitive biphone loop simply by setting the configuration variable FDRCELEFTBI 
or FORCERIGHTBI to true, as appropriate. 

Whole word recognisers can be set-up in a similar way. The word network is designed using 
the same considerations as for a sub-word based system but the dictionary gives the name of the 
whole-word HMM in place of each word pronunciation. 

Finally, word spotting systems can be defined by placing each keyword in a word network in 
parallel with the appropriate filler models. The keywords can be whole- word models or subword 
based. Note in this that word transition penalties placed on the transitions can be used to 
gain fine control over tja^ false alarm rate. 




Chapter 13 

Decoding^ with HVite 




one two three ftr\ 



The previous chapter has described how to construct arrecognition network specifying what is 
allowed to be spoken and how each word is pronounced, tiiven such a network, its associated set 
of HMMs, and an unknown utterance, the probability of Sny^ath through the network can be 
computed. The task of a decoder is to find those paths whiclKa)^ the most likely. 

As mentioned previously, decoding in HTK is performed library module called HRec. 

HRec uses the token passing paradigm to find the best path anc^T^ptionally, multiple alternative 
paths. In the latter case, it generates a lattice containing the multiple hypotheses which can if 
required be converted to an N-best list. To drive HRec from the^c&fmmand line, HTK provides a 
tool called HVite. As well as providing basic recognition, HVite ceku perform forced alignments, 
lattice rescoring and recognise direct audio input. "^K^ 

To assist in evaluating the performance of a recogniser using a test data^^e and a set of reference 
transcriptions, HTK also provides a tool called HResults to compute word^ccuracy and various 
related statistics. The principles and use of these recognition facilities are described in this chapter. 

13.1 Decoder Operation 

As described in Chapter 12 and illustrated by Fig. 12.1, decoding in HTK is "^^troUed by a 
recognition network compiled from a word- level network, a dictionary and a set of HMMs. The 
recognition network consists of a set of nodes connected by arcs. Each node is either a HMM model 
instance or a word-end. Each model node is itself a network consisting of states connected by arcs. 
Thus, once fully compiled, a recognition network ultimately consists of HMM states connected by 
transitions. However, it can be viewed at three difi^erent levels: word, model and state. Fig. 13.1 
illustrates this hierarchy. 
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level 
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level 



Recognition Netw^ork 
Levels 



For an unknown input utterance wi® 
of the network which passes through e: 



frames, every path from the start node to the exit node 
'tly T emitting HMM states is a potential recognition 
hypothesis. Each of these paths has a probabihty which is computed by summing the log 
probabihty of each individual transition in t,h^path and the log probability of each emitting state 
generating the corresponding observation. WJEhin-HMM transitions are determined from the HMM 
parameters, between- model transitions are consfent and word-end transitions are determined by the 
language model likelihoods attached to the worcClevel networks. 

The job of the decoder is to find those paths "<^ough the network which have the highest log 
probability. These paths are found using a Token Posing algorithm. A token represents a partial 
path through the network extending from time 0 through to time t. At time 0, a token is placed in 



^^ii 



Hence, at the end^ each time step, all but the N 



every possible start node. 

Each time step, tokens are propagated along connoting transitions stopping whenever they 
reach an emitting HMM state. When there are multiple from a node, the token is copied so 
that all possible paths are explored in parallel. As the token passes across transitions and through 
nodes, its log probability is incremented by the corresponding%^bf^nsition and emission probabilities. 
A network node can hold at most N tokens. 

best tokens in any node are discarded. - _^ 

As each token passes through the network it must maintain aVUstory recording its route. The 
amount of detail in this history depends on the required recogniti^^Xutput. Normally, only word 
sequences are wanted and hence, only transitions out of word-end nodaBOj,eed be recorded. However, 
for some purposes, it is useful to know the actual model sequence ai^d-Ae time of each model to 
model transition. Sometimes a description of each path down to the stara-Jevel is required. All of 
this information, whatever level of detail is required, can conveniently be represented using a lattice 
structure. C 

Of course, the number of tokens allowed per node and the amount of history information re- 
quested will have a significant impact on the time and memory needed to compi^CSythe lattices. The 
most efficient configuration is A'^ = 1 combined with just word level history information and this is 
sufficient for most purposes. 

A large network will have many nodes and one way to make a significant r^^ction in the 
computation needed is to only propagate tokens which have some chance of being amongst the 
eventual winners. This process is called pruning. It is implemented at each time step by keeping a 
record of the best token overall and de-activating all tokens whose log probabilities fall more than 
a beam-width below the best. For efficiency reasons, it is best to implement primary pruning at the 
model rather than the state level. Thus, models are deactivated when they have no tokens in any 
state within the beam and they are reactivated whenever active tokens are propagated into them. 
State-level pruning is also implemented by replacing any token by a null (zero probability) token if 
it falls outside of the beam. If the pruning beam-width is set too small then the most likely path 
might be pruned before its token reaches the end of the utterance. This results in a search error. 
Setting the beam-width is thus a compromise between speed and avoiding search errors. 
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When using word loops with bigram probabihties, tokens emitted from word-end nodes will have 
a language model probability added to them before entering the following word. Since the range of 
language model probabilities is relatively small, a narrower beam can be applied to word-end nodes 
without incurring additional search errors. This beam is calculated relative to the best word-end 
token and it is called a word-end beam. In the case, of a recognition network with an arbitrary 
topology, word-end pruning may still be beneficial but this can only be justified empirically. 

Finally, a third type of pruning control is provided. An upper-bound on the allowed use of 
compute resource can be applied by setting an upper-limit on the number of models in the network 
which can be active simultaneously. When this limit is reached, the pruning beam-width is reduced 
in order to prevent it<^g^ing exceeded. 

13.2 Decoder ©*"ganisation 

The decoding process itself i^MDcrformed by a set of core functions provided within the library 
module HRec. The process ofj^ognising a sequence of utterances is illustrated in Fig. 13.2. 

The first stage is to create a iiecqgniser-instance. This is a data structure containing the compiled 
recognition network and storage restoring tokens. The point of encapsulating all of the information 
and storage needed for recognition^fmto a single object is that HRec is re-entrant and can support 
multiple recognisers simultaneously.^^hus, although this facility is not utilised in the supplied 
recogniser HVite, it does provide ajiiications developers with the capability to have multiple 
recognisers running with different netwense. 

Once a recogniser has been created, e^ich unknown input is processed by first executing a start 
recogniser call, and then processing each ol^rvation one-by-one. When all input observations have 
been processed, recognition is completed by^nerating a lattice. This can be saved to disk as a 
standard lattice format (SLF) file or converted rt?p a transcription. 

The above decoder organisation is extremely flexible and this is demonstrated by the HTK tool 
H ViTE which is a simple shell program designea tci ^.Uow HRec to be driven from the command 
line. V 

Firstly, input control in the form of a recognitior^itetwork allows three distinct modes of opera- 
tion 



'•6 



o 



o 

% 



13.2 Decoder Organisation 



185 



Recognition 
Network 




Create Recognition 
Network 



I 



Start Recogniser 



Unknown 
Speech 



Read and Process 
Observation 




Lattice 
(SLF) 



Y 



Convert^ 
Transcripti^f]^ 

Sbel File 




Fig. 13.2 Recognition Processing 



Recognition 

This is the conventional case in which the recognition network is compile^^fepm a task level 
word network. 



Forced Alignment 

In this case, the recognition network is constructed from a word level transcription (i.e. orthog- 
raphy) and a dictionary. The compiled network may include optional silences between words 
and pronunciation variants. Forced alignment is often useful during training to automatically 
derive phone level transcriptions. It can also be used in automatic annotation systems. 

Lattice-based Rescoring 

In this case, the input network is compiled from a lattice generated during an earlier recog- 
nition run. This mode of operation can be extremely useful for recogniser development since 
rescoring can be an order of magnitude faster than normal recognition. The required lattices 
are usually generated by a basic recogniser running with multiple tokens, the idea being to 
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generate a lattice containing both the correct transcription plus a representative number of 
confusions. Rescoring can then be used to quickly evaluate the performance of more advanced 
recognisers and the effectiveness of new recognition techniques. 

The second source of flexibility lies in the provision of multiple tokens and recognition output in 
the form of a lattice. In addition to providing a mechanism for rescoring, lattice output can be used 
as a source of multiple hypotheses either for further recognition processing or input to a natural 
language processor. Where convenient, lattice output can easily be converted into N-best lists. 

Finally, since HRec is explicitly driven step-by-step at the observation level, it allows fine control 
over the recognition p^^cess and a variety of traceback and on-the-fly output possibilities. 

For application deygtepers, HRec and the HTK library modules on which it depends can be 
linked directly into appfrcptions. It will also be available in the form of an industry standard API. 
However, as mentioned es^er the HTK toolkit also supplies a tool called HVite which is a shell 
program designed to allow* HRec to be driven from the command line. The remainder of this 
chapter will therefore explaii^^pje various facilities provided for recognition from the perspective of 

HVlTE. 

13.3 Recognition usmg Test Databases 

When building a speech recognition^^^stem or investigating speech recognition algorithms, per- 



formance must be monitored by testing^cm databases of test utterances for which reference tran- 
scriptions are available. To use HVite '^jp this purpose it is invoked with a command line of the 
form ' 

HVite -w wdnet diet hmmlist te stf0testf2 

where wdnet is an SLF file containing the wor^i^el network, diet is the pronouncing dictionary 
and hmmlist contains a list of the HMMs to use". The effect of this command is that HVite will 
use HNet to compile the word level network andXthep use HRec to recognise each test file. The 
parameter kind of these test files must match exa^ly with that used to train the HMMs. For 
evaluation purposes, test files are normally stored in J^ameterised form but only the basic static 
coefficients are saved on disk. For example, delta parame^^ are normally computed during loading. 
As explained in Chapter 5, HTK can perform a range o^^arameter conversions on loading and 
these are controlled by configuration variables. Thus, when using HVite, it is normal to include 
a configuration file via the -C option in which the required target parameter kind is specified. 
Section 13.6 below on processing direct audio input explains ^^b^ use of configuration files in more 
detail. 

In the simple default form of invocation given above, HVite /Wsuld expect to find each HMM 
definition in a separate file in the current directory and each outpuT^sanscription would be written 
to a separate file in the current directory. Also, of course, there wii^^ybically be a large number of 
test files. 

In practice, it is much more convenient to store HMMs in master'@ELcro files (MMFs), store 
transcriptions in master label files (MLFs) and list data files in a script fi]|C^Thus, a more common 
form of the above invocation would be 

HVite -T 1 -S test.scp -H hmmset -i results -w wdnet diet hnuilist 



where the file test . scp contains the list of test file names, hmmset is an MMF containing the HMM 
definitions^, and results is the MLF for storing the recognition output. \y. 

As shown, it is usually a good idea to enable basic progress reporting by setting 'yra trace option 
as shown. This will cause the recognised word string to be printed after processing each file. For 
example, in a digit recognition task the trace output might look like 



File: testfl.mfc 

SIL ONE NINE FOUR SIL 

[178 frames] -96.1404 [Ac=-16931.8 LM=-181.2] (Act=75.0) 



^ Large HMM sets will often be distributed across a number of MMF files, in this case, the -H option will be 
repeated for each file. 
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where the information listed after the recognised string is the total number of frames in the utter- 
ance, the average log probability per frame, the total acoustic likelihood, the total language model 
likelihood and the average number of active models. 

The corresponding transcription written to the output MLF form will contain an entry of the 
form 



"testf 1 .rec" 

0 6200000 SIL -6067.333008 

6200000 9200000 ONE -3032.359131 

9200000 12300^0 NINE -3020.820312 
12300000 176000^ FOUR -4690.033203 
17600000 178000OT^L -302.439148 

• 

This shows the start and enc^^^ijne of each word and the total log probability. The fields output by 
HViTE can be controlled using,^e -o. For example, the option -o ST would suppress the scores 



and the times to give 
"testf 1 .rec" 



SIL 
ONE 

NINE 
FOUR 
SIL 



In order to use HVite effectively and effici'^^ly, it is important to set appropriate values for its 
pruning thresholds and the language model scaling parameters. The main pruning beam is set by 
the -t option. Some experimentation will be necess^kfy to determine appropriate levels but around 



250.0 is usually a reasonable starting point. Word-endcpruning (-v) and the maximum model limit 
(-u) can also be set if required, but these are not mEmdatory and their effectiveness will depend 
greatly on the task. V^>^ 

The relative levels of insertion and deletion errors rail be controlled by scaling the language 
model likelihoods using the -s option and adding a fixed j^^alty using the -p option. For example, 
setting -s 10.0 -p -20 . 0 would mean that every language model log probability x would be 
converted to 10a; — 20 before being added to the tokens emJlte^from the corresponding word-end 
node. As an extreme example, setting -p 100.0 caused the di^ uecogniser above to output 

SIL OH OH ONE OH OH OH NINE FOUR OH OH OH OH SIL q 

where adding 100 to each word-end transition has resulted in a larg^^^^knber of insertion errors. The 
word inserted is "oh" primarily because it is the shortest in the vocabiilSjy. Another problem which 
may occur during recognition is the inability to arrive at the final nocfejik the recognition network 
after processing the whole utterance. The user is made aware of the pro^em by the message "No 
tokens survived to final node of network" . The inability to match the darsr4gainst the recognition 
network is usually caused by poorly trained acoustic models and/or very tigh\ pruning beam- widths. 
In such cases, partial recognition results can still be obtained by setting the^HREC configuration 
variable FORCEOUT true. The results will be based on the most likely partial l^T^othesis found in 
the network. ^-^ 

13.4 Evaluating Recognition Results 

Once the test data has been processed by the recogniser, the next step is to analyse the results. 
The tool HResults is provided for this purpose. HResults compares the transcriptions output 
by HViTE with the original reference transcriptions and then outputs various statistics. HResults 
matches each of the recognised and reference label sequences by performing an optimal string match 
using dynamic programming. Except when scoring word-spotter output as described later, it does 
not take any notice of any boundary timing information stored in the files being compared. The 
optimal string match works by calculating a score for the match with respect to the reference such 
that identical labels match with score 0, a label insertion carries a score of 7, a deletion carries a 
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score of 7 and a substitution carries a score of 10^. The optimal string match is the label alignment 
which has the lowest possible score. 

Once the optimal alignment has been found, the number of substitution errors (S), deletion 
errors (D) and insertion errors (/) can be calculated. The percentage correct is then 

Percent Correct = ^ ~ ^ x 100% (13.1) 

where N is the total number of labels in the reference transcriptions. Notice that this measure 
ignores insertion errors. For many purposes, the percentage accuracy defined as 

N-D-S-I 

^ Percent Accuracy = — x 100% (13.2) 

is a more representative ^dre of recogniser performance. 

HResults outputs botk of the above measures. As with all HTK tools it can process individual 
label files and files stored inrl'PjFs. Here the examples will assume that both reference and test 
transcriptions are stored in M^jPe. 

As an example of use, suppdSethat the MLF results contains recogniser output transcriptions, 
ref s contains the correspondingrfeference transcriptions and wlist contains a list of all labels 
appearing in these files. Then typmg'the command 

HResults -I re,s .Us. r.sf 



would generate something like the foUo-vjp^ 

====================== HTK Results^nalysis 



Date 
Ref 
Rec 



Sat Sep 2 14:14:22 1995 VS 
refs - 



results 

Overall Results - 

SENT: 7.Correct=98.50 [H=197, S=3, N=2003s^ 



WORD: 7.Corr=99.77, Acc=99.65 [H=853, D=l,^l, 1=1, N=855] 



The first part shows the date and the names of the file^^^ing used. The line labelled SENT shows 
the total number of complete sentences which were recogmsed correctly. The second line labelled 
WORD gives the recognition statistics for the individual words'^ 

It is often useful to visually inspect the recognition error?. Setting the -t option causes aligned 
test and reference transcriptions to be output for all sentenc%)containing errors. For example, a 
typical output might be 

Aligned transcription: testf9.1ab vs testf9.rec o 
LAB: FOUR SEVEN NINE THREE 

REC: FOUR OH SEVEN FIVE THREE ^ 

here an "oh" has been inserted by the recogniser and "nine" has beerfc^^j^gnised as "five" 

If preferred, results output can be formatted in an identical manner ^oJNIST scoring software 
by setting the -h option. For example, the results given above would apne^l^ as follows in NIST 
format ^ 



HTK Results Analysis at Sat Sep 


2 14:42:06 1995 




Ref: refs 






Rec: results 






# Snt 1 Corr Sub 


Del Ins Err S 


Err 


Sum/Avg 1 200 I 99.77 0.12 


0.12 0.12 0.35 


1.50 



% 



^Thc default behaviour of HResults is slightly different to the widely used US NIST scoring software which uses 
weights of 3,3 and 4 and a slightly different alignment algorithm. Identical behaviour to NIST can be obtained by 
setting the -n option. 

^ All the examples here will assume that each label corresponds to a word but in general the labels could stand 
for any recognition unit such as phones, syllables, etc. HRESULTS does not care what the labels mean but for human 
consumption, the labels SENT and WORD can be changed using the -a and -b options. 
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When computing recognition results it is sometimes inappropriate to distinguish certain labels. 
For example, to assess a digit recogniser used for voice dialing it might be required to treat the 
alternative vocabulary items "oh" and "zero" as being equivalent. This can be done by making 
them equivalent using the -e option, that is 

HResults -e ZERO OH 



If a label is equated to the special label ???, then it is ignored. Hence, for example, if the recognition 
output had silence marked by SIL, the setting the option -e ??? SIL would cause all the SIL labels 
to be ignored. 

HResults contai^ja number of other options. Recognition statistics can be generated for each 
file individually by setti^^the -f option and a confusion matrix can be generated by setting the -p 
option. When comparing^^hone recognition results, HResults will strip any triphone contexts by 
setting the -s option. HKesults can also process N-best recognition output. Setting the option 
-d N causes HResults to%earch the first N alternatives of each test output file to find the most 
accurate match with the r labels. 

When analysing the perforMMice of a speaker independent recogniser it is often useful to obtain 
accuracy figures on a per speakeij-b^sis. This can be done using the option -k mask where mask is a 
pattern used to extract the speakCTj^ntifier from the test label file name. The pattern consists of a 
string of characters which can includejhe pattern matching metacharacters * and ? to match zero 
or more characters and a single charapger, respectively. The pattern should also contain a string of 
one or more */» characters which are us^^as a mask to identify the speaker identifier. 

For example, suppose that the test fiJra)ames had the following structure 

v- 



DIGITS_spkr_nnnn . rec 



msn 1 



where spkr is a 4 character speaker id and nnsn is a 4 digit utterance id. Then executing HResults 
by 

would give output of the form 



HResults -h -k ' *_rm_???? . * ' 



HTK Results Analysis at Sat Sep 2 15:05^ 1995 
Ref : refs 



Rec: results 



SPKR 


# Snt 


Corr 


Sub 


Del 


Ins X^Err S . Err 


dgol 


20 


100.00 


0.00 


0.00 




0.00 (00 0.00 


pcwl 


20 


97.22 


1.39 


1.39 


0.00 2v£^ 10.00 




Sum/Avg 


200 


99.77 


0.12 


0.12 


0.12 0.35 1.5>q 



In addition to string matching, HResults can also analyse the results of a reCC^niser configured 
for word-spotting. In this case, there is no DP alignment. Instead, each recogfiiser label w is 
compared with the reference transcriptions. If the start and end times of w lie ertlaer side of the 
mid-point of an identical label in the reference, then that recogniser label represents otherwise 
it is a, false- alarm (FA). 

The recogniser output must include the log likelihood scores as well as the word boundary 
information. These scores are used to compute the Figure of Merit (FOM) defined by NIST which 
is an upper-bound estimate on word spotting accuracy averaged over 1 to 10 false alarms per hour. 
The FOM is calculated as follows where it is assumed that the total duration of the test speech is 
T hours. For each word, all of the spots are ranked in score order. The percentage of true hits pi 
found before the i'th false alarm is then calculated for i — 1 ... TV -I- 1 where N is the first integer 
> lOT — 0.5. The figure of merit is then defined as 



FOM = — (pi + p2 + • • • + Piv + apjv+i) 



(13.3) 
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where a = lOT — N is a, factor that interpolates to 10 false alarms per hour. 

Word spotting analysis is enabled by setting the -w option and the resulting output has the 
form 





Figur 


es of 


Merit 




Keyword 


#Hits 


#FAs 


#Actual 


FOM 


BADGE 


92 


83 


102 


73.56 


CAMERA 


20 


2 


22 


89.86 


WINDOW 


84 


8 


92 


86.98 


VIDEO 




6 


72 


99.81 


Overall 


'^68 


99 


188 


87.55 



If required the standard time unit of 1 hour as used in the above definition of FOM can be changed 
using the -u option. 



13.5 Generating 
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Transci^ions 



file.mfc 







HVlTE 







file.rec 




HViTE can be made to compute forced alignments b^^ot specifying a network with the -w 
option but by specifying the -a option instead. In this mode, HVite computes a new network 
for each input utterance using the word level transcriptions^^d a dictionary. By default, the 
output transcription will just contain the words and their bsmickiries. One of the main uses of 
forced alignment, however, is to determine the actual pronunci^cfions used in the utterances used 
to train the HMM system in this case, the -m option can be use^^o generate model level output 
transcriptions. This type of forced alignment is usually part (?f^ bootstrap process, initially 
models are trained on the basis of one fixed pronunciation per wordr^iien HVite is used in forced 
alignment mode to select the best matching pronunciations. The new phone level transcriptions 
can then be used to retrain the HMMs. Since training data may hav^feaaing and trailing silence, 
it is usually necessary to insert a silence model at the start and end of^^ recognition network. 
The -b option can be used to do this. 

As an illustration, executing 



o 
o 



HVite -a -b sil -m -o SWT -I words. mlf \ 
-H hmmset diet hmmlist file.mfc 

would result in the following sequence of events (see Fig. 13.3). The input file vf^ae file.mfc 
would have its extension replaced by lab and then a label file of this name would be searched 
for. In this case, the MLF file words. mlf has been loaded. Assuming that this file contains a 
word level transcription called file. lab, this transcription along with the dictionary diet will be 
used to construct a network equivalent to file. lab but with alternative pronunciations included 
in parallel. Since -b option has been set, the specified sil model will be inserted at the start 
and end of the network. The decoder then finds the best matching path through the network and 
constructs a lattice which includes model alignment information. Finally, the lattice is converted 
to a transcription and output to the label file file.rec. As for testing on a database, alignments 



* The HLEd ex command can be used to compute phone level transcriptions when there is only one possible 
phone transcription per word 
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will normally be computed on a large number of input files so in practice the input files would be 
listed in a . scp file and the output transcriptions would be written to an MLF using the -i option. 

When the -m option is used, the transcriptions output by HVite would by default contain both 
the model level and word level transcriptions . For example, a typical fragment of the output 
might be 

7500000 8700000 f -1081.604736 FOUR 30.000000 

8700000 9800000 ao -903.821350 

9800000 10400000 r -665.931641 
10400000 10400Q^ sp -0.103585 
10400000 117000?^s -1266.470093 SEVEN 22.860001 
11700000 1250000^^ -765.568237 
12500000 13000000"^ -476.323334 
13000000 14400000 n, -1285.369629 
14400000 14400000 sp^ -0.103585 

Here the score alongside each r^^el name is the acoustic score for that segment. The score alongside 
the word is just the language mdi^l score. 

Although the above informati^ifCcan be useful for some purposes, for example in bootstrap 
training, only the model names are i'eeiiired. The formatting option -o SWT in the above suppresses 
all output except the model names. 



13.6 Recognition using ^jrect Audio Input 

In all of the preceding discussion, it has b'^^ assumed that input was from speech files stored 
on disk. These files would normally have be^i^Btored in parameteriscd form so that little or no 
conversion of the source speech data was requir^cf^ When HVite is invoked with no files listed on 
the command line, it assumes that input is to be tajsem directly from the audio input. In this case, 
configuration variables must be used to specify firsthschow the speech waveform is to be captured 
and secondly, how the captured waveform is to be csnvjrted to parameterised form. 

Dealing with waveform capture first, as described ia^ction 5.12, HTK provides two main forms 
of control over speech capture: signals/keypress and an(^^^5tomatic speech/silence detector. To use 
the speech/silence detector alone, the configuration file wi^d contain the following 

# Waveform capture • . 
S0URCERATE=625 . 0 ^ 
SOURCEKIND=HAUDIO 

o 

MEASURES IL=F \^ 
OUTSILWARN=T \^ 
ENORMALISE=F 

where the source sampling rate is being set to 16kHz. Notice that DURCEKIND must be 

set to HAUDID and the SOURCEFORMAT must be set to HTK. Setting the Boolean variable USESILDET 
causes the speech/silence detector to be used, and the MEASURESIL OUTSILWJ^N variables result in 
a measurement being taken of the background silence level prior to capturing/th£ first utterance. 
To make sure that each input utterance is being captured properly, the HVite oppqn -g can be set 
to cause the captured wave to be output after each recognition attempt. Note thart^r a live audio 
input system, the configuration variable ENORMALISE should be explicitly set to FfegE both when 
training models and when performing recognition. Energy normalisation cannot be used with live 
audio input, and the default setting for this variable is TRUE. 

As an alternative to using the speech/silence detector, a signal can be used to start and stop 
recording. For example, 

# Waveform capture 
S0URCERATE=625 . 0 
SOURCEKIND=HAUDIO 
SOURCEFORMAT=HTK 
AUDI0SIG=2 



SOURCEFORMAT=HTK 
USESILDET=T 
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would result in the Unix interrupt signal (usually the Control-C key) being used as a start and stop 
control''. Key-press control of the audio input can be obtained by setting AUDIOSIG to a negative 
number. 

Both of the above can be used together, in this case, audio capture is disabled until the specified 
signal is received. From then on control is in the hands of the speech/silence detector. 

The captured waveform must be converted to the required target parameter kind. Thus, the 
configuration file must define all of the parameters needed to control the conversion of the waveform 
to the required target kind. This process is described in detail in Chapter 5. As an example, the 
following parameters would allow conversion to Mel-frequency cepstral coefficients with delta and 
acceleration paramet^^ 



# Waveform to MRGC^ parameters 



pax 
_Cf2>A 



TARGETKIND=MFCC 
TARGETRATE=100000 . (J 
WIND0WSIZE=250000.0 
ZMEANSOURCE=T 
USEHAMMING = T . 
PREEMCOEF = 0.97 
USEPOWER = T O 
NUMCHANS =26 
CEPLIFTER =22 



NUMCEPS =12 ^ \^ 

Many of these variable settings are the defetult settings and could be omitted, they are included 



explicitly here as a reminder of the main co^^rffeuration options available. 

When HViTE is executed in direct audio i*$W,t mode, it issues a prompt prior to each input and 
it is normal to enable basic tracing so that the-recognition results can be seen. A typical terminal 
output might be C 



READY [1]> 

Please speak sentence - measuring level 
Level measurement completed _ 

[258 frames] -97.8668 [Ac=-25031.3 ^-218.4] (Act=22.3) 



DIAL ONE FOUR SEVEN 



READY [2] > 

CALL NINE TWO EIGHT . 

== [233 frames] -97.0850 [Ac=-22402.5 LM=-2^4] (Act=21.8) 



'•6 



etc 



O 



If required, a transcription of each spoken input can be output to file or an MLF in the 

usual way by setting the -e option. However, to do this a file name be synthesised. This is 

done by using a counter prefixed by the value of the HVite configuratic^^ariable RECOUTPREFIX 
and suffixed by the value of RECOUTSUFFIX . For example, with the setting^^^ 

RECOUTPREFIX = sjy , 
RECOUTSUFFIX = .rec ^ 

then the output transcriptions would be stored as sjyOOOl.rec, sjy0002.rec 

13.7 N-Best Lists and Lattices 

As noted in section 13.1, HVite can generate lattices and N-best outputs. To generate an N-best 
list, the -n option is used to specify the number of N-best tokens to store per state and the number 
of N-best hypotheses to generate. The result is that for each input utterance, a multiple alternative 
transcription is generated. For example, setting -n 4 20 with a digit recogniser would generate an 
output of the form 



^ The underlying signal number must be given, HTK eannot interpret the standard Unix signal names such as 
SIGINT 
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"testf 1 .rec" 

FOUR 

SEVEN 

NINE 

OH 

III 

FOUR 

SEVEN 

NINE 

OH ^ 
OH 
III 

etc 



The lattices from which tfl£^-best hsts are generated can be output by setting the option -z 
ext. In this case, a lattice callea testf . ext will be generated for each input test file testf .xxx. 
By default, these lattices will bej^tored in the same directory as the test files, but they can be 
redirected to another directory usim)the -1 option. 

The lattices generated by HViTE^^ve the following general form 

VERSI0N=1.0 
UTTERANCE=testf l.mfc 
lmiiame=wdnet 

lmscale=20.00 wdpenalty=-30 . 00 ^ 

\ 

v=0 a=-32@.01 1=0.00 
v=0 a=-382CL77 1=0.00 

v=0 a=-246 . 99^^=- 1.20 

The first 5 lines comprise a header which records names of th^^Jes used to generate the lattice 
along with the settings of the language model scale and penalty fa^t^rs. Each node in the lattice 
represents a point in time measured in seconds and each arc represerrt§'a>word spanning the segment 
of the input starting at the time of its start node and ending at the titneiof its end node. For each 
such span, v gives the number of the pronunciation used, a gives the almustic score and 1 gives the 
language model score. (3 

The language model scores in output lattices do not include the scal^^^ctors and penalties. 
These are removed so that the lattice can be used as a constraint network for subsequent recogniser 
testing. When using HVite normally, the word level network file is specifiecT using the -w option. 
When the -w option is included but no file name is included, HVite constructs tH.e^name of a lattice 
file from the name of the test file and inputs that. Hence, a new recognition network is created for 
each input file and recognition is very fast. For example, this is an efficient way of^^perimentally 
determining optimum values for the language model scale and penalty factors. 
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Chapter 14 



Fundam^tals of language 
modelline. 

\ 

The HTK language modelling tocjlgyire designed for constructing and testing statistical n-gram 
language models. This chapter in1!r»duces language modelling and provides an overview of the 
supplied tools. It is strongly recomiAetmed that you read this chapter and then work through the 
tutorial in the following chapter - thi&Jwill provide you with everything you need to know to get 
started building language models. 



Training text 




Gram files 



Vocabulary and class making, plus gram 
file sequencir^ 

^S) 



Test text 




n-gr^cm LM 

O- 




Perplexity 



An n-gram is a sequence of n symbols (e.g. words, syntactic categories, and an n-gram 
language model (LM) is used to predict each symbol in the sequence given its n— ^predecessors. It 
is built on the assumption that the probability of a specific n-gram occurring in som^amknown test 
text can be estimated from the frequency of its occurrence in some given training'jbext. Thus, as 
illustrated by the picture above, n-gram construction is a three stage process. Firstly, the training 
text is scanned and its n-grams are counted and stored in a database of gram files. In the second 
stage some words may be mapped to an out of vocabulary class or other class mapping may be 
applied, and then in the final stage the counts in the resulting gram files are used to compute n-gram 
probabalities which are stored in the language model file. Lastly, the goodness of a language model 
can be estimated by using it to compute a measure called perplexity on a previously unseen test 
set. In general, the better a language model then the lower its test-set perplexity. 

Although the basic principle of an n-gram LM is very simple, in practice there are usually many 
more potential n-grams than can ever be collected in a training text in sufficient numbers to yield 
robust frequency estimates. Furthermore, for any real application such as speech recognition, the 
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use of an essentially static and finite training text makes it difficult to generate a single LM which is 
well-matched to varying test material. For example, an LM trained on newspaper text would be a 
good predictor for dictating news reports but the same LM would be a poor predictor for personal 
letters or a spoken interface to a flight reservation system. A final difficulty is that the vocabulary 
of an n-gram LM is finite and fixed at construction time. Thus, if the LM is word-based, it can only 
predict words within its vocabulary and furthermore new words cannot be added without rebuilding 
the LM. 

The following four sections provide a thorough introduction to the theory behind n-gram models. 
It is well worth reading through this section because it will provide you with at least a basic 
understanding of wha^pany of the tools and their parameters actually do - you can safely skip the 
equations if you chooseji^cause the text explains all the most important parts in plain English. The 
final section of this chap^ then introduces the tools provided to implement the various aspects of 
n-gram language modellirigthat have been described. 

• 

14.1 n-gram lang'^^^e models 

Language models estimate the ility of a word sequence, P{wi, W2, ■ ■ ■ , Wm) - that is, they 

evaluate P{wi) as defined in equat5M 1.3 in chapter 1.^ 

The probability P{wi ,W2, ■ ■ ■ , w„(^an be decomposed as a product of conditional probabilities: 

P{wi,W2, ■ ■ ■ yw)n) = Y\_^('^i \wi,.. .,Wi-i) (14-1) 

14.1.1 Word n-erram models 

Equation 14.1 presents an opportunity for apprc^i^ating P{W) by limiting the context: 

P{wi,W2,. ■ ■ ,w^) ~ Wi_„+i, . . . , (14.2) 

for some n ^ 1. If language is assumed to be ergodic - th^^, it has the property that the probability 
of any state can be estimated from a large enough history mdependent of the starting conditions'^ - 
then for sufficiently high n equation 14.2 is exact. Due to reasons of data sparsity, however, values 
of n in the range of 1 to 4 inclusive are typically used, and*the?e are also practicalities of storage 
space for these estimates to consider. Models using contigUQyIs but limited context in this way 
are usually referred to as n-gram language models, and the cos^di^ional context component of the 
probability ("wi-n+i, . . . , Wi-i" in equation 14.2) is referred to a^^e history. 

Estimates of probabilities in n-gram models are commonly bas^i^on maximum likelihood esti- 
mates - that is, by counting events in context on some given train' 

P(u;>,_„+i, . . . , z«,_i) = C{w,-n+,,..., ^^^3^ 

where C(.) is the count of a given word sequence in the training text. Refine^nts to this maximum 
likelihood estimate are described later in this chapter. , 

The choice of n has a significant effect on the number of potential paramekSIe that the model 
can have, which is maximally bounded by |W|", where W is the set of words in tn^4&nguage model, 
also known as the vocabulary. A 4-gram model with a typically-sized 65,000 Vprd vocabulary 
can therefore potentially have 65,000'* ~ 1.8 x 10*^ parameters. In practice, hf^ever, only a 
small subset of the possible parameter combinations represent likely word sequences, so the storage 
requirement is far less than this theoretical maximum - of the order of 10*^ times less in fact.'^ 
Even given this significant reduction in coverage and a very large training text"' there are still many 
plausible word sequences which will not be encountered in the training text, or will not be found a 
statistically significant number of times. It would not be sensible to assign all unseen sequences zero 

^Thc theory components of this chapter - these first four sections - are condensed from portions of "Adaptive 
Statistical Class-based Language Modelling", G.L. Moore; Ph.D thesis, Cambridge University 2001 
^See section 5 of [Shannon 1948] for a more formal definition of ergodicity. 
■^Based on the analysis of 170 million words of newspaper and broadcast news text. 
*A couple of hundred million words, for example. 
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probability, so methods of coping with low and zero occurrence word tuples have been developed. 
This is discussed later in section 14.3. 

It is not only the storage space that must be considered, however - it is also necessary to be 
able to attach a reasonable degree of confidence to the derived estimates. Suitably large quantities 
of example training text are also therefore required to ensure statistical significance. Increasing the 
amount of training text not only gives greater confidence in model estimates, however, but also 
demands more storage space and longer analysis periods when estimating model parameters, which 
may place feasibility limits on how much data can be used in constructing the final model or how 
thoroughly it can be analysed. At the other end of the scale for restricted domain models there 
may be only a limite^^iuantity of suitable in-domain text available, so local estimates may need 
smoothing with global ^iors. In addition, if language models are to be used for speech recognition 
then it is good to train t^n on precise acoustic transcriptions where possible - that is, text which 
features the hesitations, r(^petitions, word fragments, mistakes and all the other sources of deviation 
from purely grammatical language that characterise everyday speech. However, such acoustically 
accurate transcriptions are i^^^rnited supply since they must be specifically prepared; real-world 
transcripts as available for va,scms other purposes almost ubiquitously correct any disflucncies or 
mistakes made by speakers. r\N 

14.1.2 Equivalence classd^^ 

The word n-gram model described in e(^^tion 14.2 uses an equivalence mapping on the word history 
which assumes that all contexts which l^i^e the same most recent n — 1 words all have the same 
probability. This concept can be expresses more generally by defining an equivalence class function 
that acts on word histories, £{.), such thaKif £(a;) — £{y) then Vw : P{w\x) = P{w\y): 

P{Wi I Wi,W2, ■ . ■ ,W,_i}^P{w, I £{wi,W2,. ■ ■ ,Wi_i)) (14.4) 

Adefinitionof.thatdescr.besaword.-gra^isthus: 

fword-n-gram(wi, .. .,Wi) =^{Wi-n+l, . . . , Wi) (14.5) 



In a good language model the choice of £ should that it provides a reliable predictor 

of the next word, resulting in classes which occur frequ^tiy enough in the training text that they 
can be well modelled, and does not result in so many dist^5):t history equivalence classes that it is 
infeasible to store or analyse all the resultant separate probg,bilities. 

14.1.3 Class n-gram models 

One method of reducing the number of word history equivalen^)classes to be modelled in the 
n-gram case is to consider some words as equivalent. This can be ff^lemented by mapping a set 
of words to a word class 5 € G by using a classification function C?U(;^= g. If any class contains 
more than one word then this mapping will result in less distinct word\^9!6ses than there are words, 
IGI < IWI, thus reducing the number of separate contexts that must be\;Wsidered. The equivalence 

f class- n-gram (Wl, . . . , Wi) = £{G{Wi^n+l), G{Wi)) ^ (14.6) 

A deterministic word-to-class mapping like this has some advantages over a ^^d n-gram model 
since the reduction in the number of distinct histories reduces the storage space training data 
requirements whilst improving the robustness of the probability estimates for a gi^n quantity of 
training data. Because multiple words can be mapped to the same class, the model ^s the ability 
to make more confident assumptions about infrequent words in a class based on other more frequent 
words in the same class'' than is possible in the word n-gram case ~ and furthermore for the same 
reason it is able to make generalising assumptions about words used in contexts which are not 
explicitly encountered in the training text. These gains, however, clearly correspond with a loss in 
the ability to distinguish between different histories, although this might be offset by the ability to 
choose a higher value of n. 

The most commonly used form of class n-gram model uses a single classification function, G{.), 
as in equation 14.6, which is applied to each word in the n-gram, including the word which is being 



^ Since it is assumed that words are placed in the same class because they share certain properties. 
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predicted. Considering for clarity the bigram^ case, then given G{.) the language model has the 
terms Wi, Wi-i, G{wi) and G{wi-i) available to it. The probability estimate can be decomposed 
as follows: 

PcUss'{Wi I Wi-i) = P{Wi I G{Wi),G{w.i-i),Wi-i) 

X P(GK) I G(u;,_i),w,_i) (14.7) 

It is assumed that P{wi \ G{wi),G{wi_i),Wi_i) is independent of G(u'j_i) and Wi_i and that 
P{G(wi) I G{wi^i),Wi-i) is independent of resulting in the model: 

'^siw^ I u;,_i) = Piw, I G{w,)) X P{Giw,) \ G(w;,-i)) (14.8) 

Almost all reported d^ae n-gram work using statistically-found classes is based on clustering 
algorithms which optimise f (.) on the basis of bigram training set likelihood, even if the class map 
is to be used with longer-coniext models. It is interesting to note that this approximation appears 
to works well, however, sugges^irfg that the class maps found are in some respects "general" and 
capture some features of naturai-language which apply irrespective of the context length used when 
finding these features. > 

14.2 Statistically-deri^d Class Maps 

An obvious question that arises is howvf§) compute or otherwise obtain a class map for use in a 
language model. This section discusses one^tlrategy which has successfully been used. 

Methods of statistical class map construction seek to maximise the likelihood of the training 
text given the class model by making iterative controlled changes to an initial class map - in order 
to make this problem more computationally fea^iUe they typically use a deterministic map. 

14.2.1 Word exchange algorithm \^ 

[Kneser and Ney 1993]' describes an algorithm to a class map by starting from some initial 

guess at a solution and then iteratively searching for^^ajiges to improve the existing class map. 
This is repeated until some minimum change threshold^Tras been reached or a chosen number of 
iterations have been performed. The initial guess at a c^^jp map is typically chosen by a simple 
method such as randomly distributing words amongst classes or placing all words in the first class 
except for the most frequent words which are put into singleto^ classes. Potential moves are then 
evaluated and those which increase the likelihood of the trainrsg text most are applied to the class 
map. The algorithm is described in detail below, and is impleni^raed in the HTK tool Cluster. 

Let W be the training text list of words {■wi,W2, w^,. . .) and le^^ be the set of all words in W. 
From equation 14.1 it follows that: \.^^ 

x.,yGW 

where (x, y) is some word pair 'x' preceded by 'y' and G{x, y) is the number^ times that the word 
pair 'y a;' occurs in the list W. ^ 

In general evaluating equation 14.9 will lead to problematically small valilfes^so logarithms can 
be used: Q > 

logFciass(W) = ^ G(x,2;).logPciass(a: I y) Q (^^-lO) 

Given the definition of a class n-gram model in equation 14.8, the maximum likelihood bigram 
probability estimate of a word is: 

p , I ^ C{w,) G(G(«;,),G(u;,_i)) 

6(G(wJ) C{G{wi^i)) 



®By convention unigram refers to a 1-gram, bigram indicates a 2-gram and trigram is a 3-gram. There is no 
standard term for a 4-gram. 

'^R. Kneser and H. Ney, "Improved Clustering Techniques for Class-Based Statistical Language Mod- 
elling" ; Proceedings of the European Conference on Speech Communication and Technology 1993, pp. 973-976 
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where C{w) is the number of times that the word 'w' occurs in the hst W and C{G{w)) is the 
number of times that the class G{w) occurs in the hst resuhing from applying G{.) to each entry of 
Wf similarly C{G{wx),G{wy)) is the count of the class pair 'G{wy) G{wx)' in that resultant list. 
Substituting equation 14.11 into equation 14.10 and then rearranging gives: 

^G{x).logG{x) - ^ C(2:). log C(G(x)) 

Cig,h).logC{g,h) - ^C(.g).logC(5) 

g,he'^ geG 

= ^C(x)y^C(x) + ^ C7(g,/i).logC'(5,M 

xeW ^ g,hGG 

-2j2C{g)(^C{g) (14.12) 

geG 

where {g, h) is some class sequence '/i g\ \^ 

Note that the first of these three terms'Tji the final stage of equation 14.12, "X^^gw ^(^) • 
log(C(a;))", is independent of the class map mnretion G(.), therefore it is not necessary to consider 
it when optimising G(.). The value a class map^s^st seek to maximise, -Fmci can now be defined: 

^Mc = E C(5,M-logC^» - 2Y,C{g).logG{g) (14.13) 

g.heG geG 

A fixed number of classes must be decided befor^^janning the algorithm, which can now be 
formally defined: ^ 



1^ 



m tne : 



Initialise: Vw G W : G{w) = 1 

Set up the class map so that all words are in tlie&st class and all other 
classes are empty ( or initialise using some other^s^^me) 

Iterate: Vi G {1 . . . n} A (~\ 

For a given number of iterations 1 ... n or until seme stop criterion s is 
fulfilled ^ > 

(a) Iterate: Vw G W Q 
For each word w in the vocabulary 

i. Iterate: Vc G G 

For each class c ^ 

A. Move word w to class c, remembering its previous- ciass 

B. Calculate the change in Fmc for this move 

C. Move word w back to its previous class <^ 

ii. Move word w to the class which increased -Fmc by the most, 
or do not move it if no move increased -Fmc 



The initialisation scheme given here in step 1 represents a word unigram language model, making 
no assumptions about which words should belong in which class. The algorithm is greedy and so 



SThat is, CiG(w)) = E.:G(.)=G(») Ci^)- 

^Given this initialisation, the first {\G\ — 1) moves will be to place each word into an empty class, however, since 
the class map which maximises -Fm„ is the one which places each word into a singleton class. 
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can get stuck in a local maximum and is therefore not guaranteed to find the optimal class map for 
the training text. The algorithm is rarely run until total convergence, however, and it is found in 
practice that an extra iteration can compensate for even a deliberately poor choice of initialisation. 

The above algorithm requires the number of classes to be fixed before running. It should be 
noted that as the number of classes utilised increases so the overall likelihood of the training text 
will tend tend towards that of the word model. ^" This is why the algorithm does not itself modify 
the number of classes, otherwise it would naively converge on |W| classes. 



14.3 Robus^model estimation 



Given a suitably large dZcbnni of training data, an extremely long n-gram could be trained to give a 
very good model of langtf^e, as per equation 14.1 - in practice, however, any actual extant model 
must be an approximation., Because it is an approximation, it will be detrimental to include within 
the model information which, ici fact was just noise introduced by the limits of the bounded sample 
set used to train the model Vj^s information may not accurately represent text not contained 
within the training corpus. Iivrhe same way, word sequences which were not observed in the 
training text cannot be assumed^^ represent impossible sequences, so some probability mass must 
be reserved for these. The issue f^Oiow to redistribute the probability mass, as assigned by the 
maximum likelihood estimates deriv^from the raw statistics of a specific corpus, into a sensible 
estimate of the real world is addressedHpy various standard methods, all of which aim to create 
more robust language models. ^ 



14.3.1 Estimating probabilitieV\ 

aElH 



Language models seek to estimate the probaoilHy of each possible word sequence event occurring. 
In order to calculate maximum likelihood estim^stes this set of events must be finite so that the 
language model can ensure that the sum of the probabilities of all events is 1 given some context. 
In an rt-gram model the combination of the finit^i^ocabulary and fixed length history limits the 
number of unique events to |W|". For any n > 1 it i^ ighly unlikely that all word sequence events 
will be encountered in the training corpora, and manv^hat do occur may only appear one or two 
times. A language model should not give any unseeK^^ent zero probability,^^ but without an 
infinite quantity of training text it is almost certain thaiSthfire will be events it does not encounter 
during training, so various mechanisms have been develop^ to redistribute probability within the 
model such that these unseen events are given some non-zeiip probability. 

As in equation 14.3, the maximum likelihood estimate of tk^probability of an event A occurring 
is defined by the number of times that event is observed, a, ancTtjj^ total number of samples in the 
training set of all observations. A, where P{A) = ^. With this oefiiaition, events that do not occur 
in the training data are assigned zero probability since it will be tWcase that a = 0. [Katz 1987]^^ 
suggests multiplying each observed count by a discount coefficient^^^^tor, da, which is dependent 
upon the number of times the event is observed, a, such that a' =^^ya. Using this discounted 
occurrence count, the probability of an event that occurs a times now^irecomes Pdiscount (-4) = 
Different discounting schemes have been proposed that define the discounye©eflicient, da, in specific 
ways. The same discount coefficient is used for all events that occur the same number of times on 
the basis of the symmetry requirement that two events that occur with equal frequency, a, must 
have the same probability, pa- • 

Defining Ca as the number of events that occur exactly a times such that^^= X]a>i'^-'^a 
follows that the total amount of reserved mass, left over for distribution amongst ths unseen events, 

is 5^ (1 - iEa>l da.Ca.a). 



'^'^Which will be higher, given maximum likelihood estimates. 

'^^If it did then from equation 14.1 it follows that the probability of any piece of text containing that event would 
also be zero, and would have infinite perplexity. 

^•^S.M. Katz, "Estimation of Probabilities from Sparse Data for the Language Model Component of 
a Speech Recogniser"; IEEE Transactions on Acoustic, Speech and Signal Processing 1987, vol. 35 no. 3 pp. 
400-401 
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Good- Turing discounting 

In [Good 1953]^'^ a method of discounting maximum likelihood estimates was proposed whereby 
the count of an event occuring a times is discounted with 

da = (a+l)— (14.14) 

a . Ca 

A problem with this scheme, referred to as Good-Turing discounting, is that due to the count in 
the denominator it will fail if there is a case where Ca = 0 if there is any count Cf, > 0 for b > a. 
Inevitably as a increa^s the count Cq will tend towards zero and for high a there are likely to be 
many such zero counlSyA solution to this problem was proposed in [Katz 1987], which defines a 
cut-off value k at whicL^^unts a for a > k are not discounted^'' - this is justified by considering 
these more frequently ohl^pied counts as reliable and therefore not needing to be discounted. Katz 
then describes a revised discount equation which preserves the same amount of mass for the unseen 
events: 




■ (14.15) 
: a > k 

This method is itself unstable, hS^^^er - for example if k.Ck > c\ then da will be negative for 
1 < a < fc. 

CO 

Absolute discounting 

An alternative discounting method is abssi^te discounting,'^"'' in which a constant value m is sub- 
stracted from each count. The effect of this i^^at the events with the lowest counts are discounted 
relatively more than those with higher countS^J^he discount coefficient is defined as 

da (14.16) 

In order to discount the same amount of probabiii^ mass as the Good- Turing estimate, m must 
be set to: ^ 

14.3.2 Smoothing probabilities , 

The above discounting schemes present various methods of ^distributing probability mass from 
observed events to unseen events. Additionally, if events are ifiJi^uently observed then they can 
be smoothed with less precise but more frequently observed evenl^?) 

In [Katz 1987] a back off scheme is proposed and used alongsiHp-Good- Turing discounting. In 
this method probabilities are redistributed via the recursive utili^a?1aQ of lower level conditional 
distributions. Given the n-gram case, if the n-tuple is not observed freq«^tly enough in the training 
text then a probability based on the occurrence count of a shorter-drait^xt (n — l)-tuple is used 
instead - using the shorter context estimate is referred to as backing oj(f.(T)i practice probabilities 
are typically considered badly-estimated if their corresponding word sequences are not explicitly 
stored in the language model, either because they did not occur in the traming text or they have 
been discarded using some pruning mechanism. * 

Katz defines a function (3{wi-n+i, ■ ■ - Wi-i) which represents the total pnsibkbility of all the 
unseen events in a particular context. The probability mass /3 is then distributeO^Piongst all the 
unseen Wi and the language model probability estimate becomes: <^ 



P{Wi I Wi-n+l, ■ ■ ■ ,Wi-i) 



a(u'i_„+i, . . . , Wi-i) . P{wi\wi_„+2, ■ • ■ , m-i) 



1 c{wi-„+l,---,u]i) 

Uc{Wi-„ + l,...,Wi) ■ c(lOi_„ + i,...,«)i_i) 



c(-(i!i_Ti+l,...,tiJi-l) 



c{Wi_n+l, . . . ,1Ui) = 0 

l<c{Wi^n+l,---,Wi) <k (14.18) 
C{w^-n+l, ■■■,Wi)> k 



'^^I.J. Good, "The Population Frequencies of Species and the Estimation of Population Parameters"; 
Biometrika 1953, vol. 40 (3,4) pp. 237-264 

^*It is suggested that "k = 5 or so is a good choice" 

^^H. Ney, U. Essen and R. Kneser, "On Structuring Probabilistic Dependences in Stochastic Language 
Modelling"; Computer Speech and Language 1994, vol.8 no.l pp. 1-38 
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where c(.) is the count of an event and: 

/ N /3(w,-„+i, . . . 

a{w^-n+l, Wi-i) = — ^ ■ (14.19) 

Z^«,.:c(™._„ + 1,...,™,)=0 P{Wi\w^-n+2, ■ • ■ , W^-l) 

A back off scheme such as this can be implemented efficiently because all the back off weights a 
can be computed once and then stored as part of the language model, and through its recursive 
nature it is straightforward to incorporate within a language model. Through the use of pruning 
methods, contexts which occur 'too infrequently' are not stored in the model so in practice the test 
c(wi, Wi) — 0 is ijfclemented as referring to whether or not the context is in the model. 

Cut-offs A 

With a back off scheme low count events can be discarded ~ cut-off - from the model and more 
frequently observed shorter-^^text estimates can be used instead. An additional advantage of 
discarding low occurrence ev^*(5B is that the model size can be substantially reduced, since in 
general as a decreases so the ntimber of events increases - in fact the Good-Turing discounting 
scheme depends upon this relatio^hjp. 

14.4 Perplexity 

A measure of language model performaK^ based on average probability can be developed within 
the field of information theory [Shannon 3^8]^'^. A speaker emitting language can be considered 
to be a discrete information source which i^^-^nerating a sequence of words wi,W2, ■ ■ ■ ■, Wm from 
a vocabulary set, W. The probability of a'wmbol Wi is dependent upon the previous symbols 
The information source's inherg«ity-per-word entropy H represents the amount of 
non-redundant information provided by each ne^ word on average, defined in bits as: 

H = - Van — (P(u;i,-u;2,---,(wSi) log2f(wi,W2,...,Wm)) (14.20) 

This summation is over all possible sequences of wi^^s, but if the source is ergodic then the 
summation over all possible word sequences can be discar^^ and the equation becomes equivalent 
to: 

1 • > 

H = - lim — log2P(wi,W2,-Y^Wm) (14.21) 

m^oo m 

It is reasonable to assume ergodicity on the basis that we use lai^iage successfully without having 
been privy to all words that have ever been spoken or written, ai(^^e can disambiguate words on 
the basis of only the recent parts of a conversation or piece of text.(^ 

Having assumed this ergodic property, it follows that given a IcMe^finough value of m, H can 
be approximated with: 

H=--log^P{wi,W2,...,w^) (14.22) 

This last estimate is feasible to evaluate, thus providing the basis for a meti^^uitable for assessing 
the performance of a language model. ^ 

Considering a language model as an information source, it follows that a laiiguage model which 
took advantage of all possible features of language to predict words would alsoVctdeve a per-word 
entropy of H. It therefore makes sense to use a measure related to entropy toVa^ss the actual 
performance of a language model. Perplexity, PP, is one such measure that is i^^andard use, 
defined such that: 

PP = 2" (14.23) 
Substituting equation 14.22 into equation 14.23 and rearranging obtains: 

PP = P{wi,W2,...,w„,)-^ (14.24) 

where P{wi, W2, . ■ ■ ^ Wm) is the probability estimate assigned to the word sequence (wi, W2, ■ ■ ■ , Wm) 
by a language model. 



^®C.E. Shannon, "A Mathematical Theory of Communication"; The Bell System Technical Journal 1948, 
vol. 27 pp. 379-423,623-656. Available online at http://galaxy.ucsd.edu/new/external/shannoii.pdf 
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Perplexity can be considered to be a measure of on average how many different equally most 
probable words can follow any given word. Lower perplexities represent better language models, 
although this simply means that they 'model language better', rather than necessarily work better 
in speech recognition systems - perplexity is only loosely correlated with performance in a speech 
recognition system since it has no ability to note the relevance of acoustically similar or dissimilar 
words. 

In order to calculate perplexity both a language model and some test text are required, so a 
meaningful comparison between two language models on the basis of perplexity requires the same 
test text and word vocabulary set to have been used in both cases. The size of the vocabulary can 
easily be seen to be i<^yant because as its cardinality is reduced so the number of possible words 
given any history must ^onotonically decrease, therefore the probability estimates must on average 
increase and so the perp^kity will decrease. 

• 

14.5 Overview o^^-Gram Construction Process 

This section describes the ove^^ process of building an n-gram language model using the HTK 
tools. As noted in the introducti^»^ if is a three stage process. Firstly, the training text is scanned 
and the n-grams counts are stored^m)a set of gram files. Secondly, and optionally, the counts in the 
gram files are modified to perform v^jibulary and class mapping. Finally the resulting gram files 
are used to build the LM. This separatipii into stages adds some complexity to the overall process 
but it makes it much more efficient to Wjidle very large quantities of data since the gram files only 
need to be constructed once but they caiv«e augmented, processed and used for constructing LMs 
many times. \^ 

The overall process involved in building^^ n-gram language model using the HTK tools is 
illustrated in Figure 14.1. The procedure beg^^with some training text, which first of all should 
be conditioned into a suitable format by perfornirng operations such as converting numbers to a 
citation form, expanding common abbreviations Vncl go on. The precise format of the training text 
depends on your requirements, however, and can v^^ enormously - therefore conditioning tools are 
not supphed with HTK.^^ 

Given some input text, the tool LGPrep scans th^^put word sequence and counts rt-grams.^® 
These n-gram counts are stored in a buffer which fills^a^ach new n-gram is encountered. When 
this buffer becomes full, the n-grams within it are sorte(TaBd stored in a gram file. All words (and 
symbols generally) are represented within HTK by a uniqueVnteger id. The mapping from words to 
ids is recorded in a word map. On startup, LGPrep loads in an>existing word map, then each new 
word encountered in the input text is allocated a new id an^^^Mided to the map. On completion, 
LGPrep outputs the new updated word map. If more text isdp^t, this process is repeated and 
hence the word map will expand as more and more data is processed. 

Although each of the gram files output by LGPrep is sorted^he range of n-grams within 
individual files will overlap. To build a language model, all n-gram oq>HTs must be input in sort order 
so that words with equivalent histories can be grouped. To accomm ^te this, all HTK Ian guage 
modelling tools can read multiple gram files and sort them on-the-^i^ This can be inefficient, 
however, and it is therefore useful to first copy a newly generated set of gram files using the HLM 
tool LGCOPY. This yields a set of gram files which are sequenced, i.ertjee ranges of n-grams 
within each gram file do not overlap and can therefore be read in a single stream. Furthermore, the 
sequenced files will take less disc space since the counts for identical n-gram* in different files will 
have been merged. 

"In fact a very simple text conditioning Perl script is included in LMTutorial/extras/LCond. ir demonstration 
purposes only - it converts text to uppercase (so that words are considered equivalent irrespectiv^m?^ase) and reads 
the input punctuation in order to tag sentences, stripping most other punctuation. See the script ft5r more details. 
^^LGPrep can also perform text modification using supplied rules. 
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Text 



LGPrep 



Gram files 



v^GCopy 

4: 




LNewMap 



Word map 



LBuild 



n-gram LM 



Fig. 14.1 JEhe main stages in 
building an n-'gram language model 

The set of (possibly sequenced) gram files andN^eir associated word map provide the raw data 
for building an n-gram LM. The next stage in the canStruction process is to define the vocabulary 
of the LM and convert all n-grams which contain 00 V^ut of vocabulary) words so that each OOV 
word is replaced by a single symbol representing the mw^own class. For example, the n-gram AN 
OLEAGINOUS AFFAIR would be converted to AN ! !UNK AVFAIR if the word "oleaginous" was not in 
the selected vocabulary and ! !UNK is the name chosen foi(^e unknown class. 

This assignment of OOV words to a class of unknown jvords is a specific example of a more 
general mechanism. In HTK, any word can be associated witb<?c\named class by listing it in a class 
map file. Classes can be defined either by listing the class memb&iJS or by listing all non-members. 
For defining the unknown class the latter is used, so a plain teM list of all in-vocabulary words is 
supplied and all other words are mapped to the OOV class. The tkol LGCopy can use a class map 
to make a copy of a set of gram files in which all words listed in tl^^Jass map are replaced by the 
class name, and also output a word map which contains only the re^^ed vocabulary words and 
their ids plus any classes and their ids. yf\ 

As shown in Figure 14.1, the LM itself is built using the tool LBuiLEL-J?his takes as input the 
gram files and the word map and generates the required LM. The languag^ tnodel can be built in 
steps (first a unigram, then a bigram, then a trigram, etc.) or in a single pCss if required. 

• 

o 

% 
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Fig. 14.2 The main stages in buildi^^ a class-based language 

model 
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As described in section 14.1.3, a class-base language modelWnsjsts of two separate components. 
The first is an n-gram which models the sequence of classes {i)^p{ci, |ci_„+i, . . . ,c„_i)) and the 
second is a class map with associated word counts or probabilities ^^hin classes allowing the word- 
given-class probability bigram p{wk\ck) to be evaluated. These filesQ^y then either be linked into 
a single composite file or a third 'linking' file is create to point to tn^q-^wo separate files - both of 
these operations can be performed using the LLink tool. S\ 

Given a set of word classes defined in a class map file and a set of "vropd'level gram files, liuilding 
a class-based model with the HTK tools requires only a few simple nio^ifications to the basic 
procedure described above for building a word n-gram: 

• Cluster is used with the word map and word level gram files derived»from the source text 
to construct a class map which defines which class each word is in. ThQa me tool is then 
used to generate the word-classes component file referred to above. Note t^S^ Cluster can 
also be used to generate this file from an existing or manually-generated claa^^ap. 

• LGCopy is used with the class map to convert the word level gram files derived from the 
source text into class gram files. LBuild can then be used directly with the class level gram 
files to build the class sequence n-gram language model referred to above. 

• LLink is then run to create either a language model script pointing to the two separate 
language model files or a single composite file. The resulting language model is then ready 
for use. 

The main steps of this procedure are illustrated in Figure 14.2. 

The next chapter provides a more thorough introduction to the tools as well as a tutorial to 
work through explaining how to use them in practice. 



Chapter 15 



A Tutor^l Example of Building 
Language. Models 

This chapter describes the constru9fef®n and evaluation of language models using the HTK language 
modelling tools. The models will bSTwjilt from scratch with the exception of the text conditioning 
stage necessary to transform the rawHext into its most common and useful representation (e.g. 
number conversions, abbreviation expap«ion and punctuation filtering). All resources used in this 
tutorial can be found in the LMTutoriai^rectory of the HTK distribution. 

The text data used to build and testytM language models are the copyright-free texts of 50 
Sherlock Holmes stories by Arthur Conan ^)^le. The texts have been partitioned into training 
and test material (49 stories for training and'^atory for testing) and reside in the train and test 
subdirectories respectively. V)^^ 

15.1 Database preparation 

The first stage of any language model development pro^^Jt is data preparation. As mentioned in the 
introduction, the text data used in these example has alr^^y been conditioned. If you examine each 
file you will observe that they contains a sequence of taga^ sentences. When training a language 
model you need to include sentence start and end labelmig because the tools cannot otherwise 
infer this. Although there is only one sentence per line in*thg^ files, this is not a restriction of 
the HTK tools and is purely for clarity - you can have theTs^tire input text on a single line if 
you want. Notice that the default sentence start and senteno^-^d tokens of <s> and </s> are 
used - if you were to use different tokens for these you would to pass suitable configuration 
parameters to the HTK tools. ^ An extremely simple text conditionmg tool is supplied in the form 
of LCond.pl in the LMTutorial/extras folder - this only segn»*I^-text into sentences on the 
basis of punctuation, as well as converting to uppercase and strippingimost punctuation symbols, 
and is not intended for serious use. In particular it does not convert tf^^bers into words and will 
not expand abbreviations. Exactly what conditioning you perform on you^^urce text is dependent 
on the task you are building a model for. 

Once your text has been conditioned, the next step is to use the tool LGPrep to scan the input 
text and produce a preliminary set of sorted n-gram files. In this tutorial we'will store all n-gram 
files created by LGPrep will be stored in the holmes . 0 directory, so create thi^irectory now. In 
a Unix-type system, for example, the standard command is 

$ mkdir holmes. 0 

The HTK tools maintain a cumulative word map to which every new word is added and assigned 
a unique id. This means that you can add future n-gram files without having to rebuild existing 
ones so long as you start from the same word map, thus ensuring that each id remains unique. The 
side effect of this ability is that LGPrep always expects to be given a word map, so to prepare the 
first n-gram file (also referred to elsewhere as a 'gram' file) you must pass an empty word map file. 

You can prepare an initial, empty word map using the LNewMap tool. It needs to be passed 
the name to be used internally in the word map as well as a file name to write it to; optionally 



^STARTWORD and ENDWORD to be precise. 
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you may also change the default character escaping mode and request additional fields. Type the 
following: 

$ LNewMap -f WFC Holmes empty. wmap 

and you'll see that an initial, empty word map file has been created for you in the file empty. wmap. 
Examine the file and you will see that it contains just a header and no words. It looks like this: 



Name = Holmes 
SeqNo = 0 
Entries = 0 
EscMode = RAW 
Fields = ID, WFC 
\Words\ 

• 

Pay particular attention to tte^SeqNo field since this represents the sequence number of the word 
map. Each time you add words^ the word map the sequence number will increase - the tools will 
compare the sequence number ki the word map with that in any data files they are passed, and if 
the word map is too old to contai^atl the necessary words then it will be rejected. The Name field 
must also match, although initiall;5^^bu can set this to whatever you like.^ The other fields specify 
that no HTK character escaping wi|5^e used, and that we wish to store the (compulsory) word 
ID field as well as an optional count fi^d, which will reveal how many times each word has been 
encountered to date. The ID field is alVajts present which is why you did not need to pass it with 
the -f option to LNewMap. 

To clarify, if we were to use the Sherlo(5i(TIolmes texts together with other previously generated 
n-gram databases then the most recent wori^^ap available must be used instead of the prototype 
map file above. This would ensure that the m^)j,saved by LGPrep once the new texts have been 
processed would be suitable for decoding all avaij^^le n-gram files. 

We'll now process the text data with the following command: 

$ LGPrep -T 1 -a 100000 -b 200000 -d holmes /<J\-n 4 

-s "Sherlock Holmes" empty. wmap trai^y*.txt 

The -a option sets the maximum number of new wo that can be encountered in the texts to 
100,000 (in fact, this is the default). If, during processing, ^is limit is exceeded then LGPrep will 
terminate with an error and the operation will have to be repeated by setting this limit to a larger 
value. * 



The -b option sets the internal n-gram buffer size to 200,800 ra-gram entries. This setting has 
a direct effect on the overall process size. The memory requi^j^t for the internal buffer can be 
calculated according to merribytes = (n -I- 1) * 4 * 5 where n is n-gram size (set with the -n 
option) and b is the buffer size. In the above example, the n-grai^Tsize is set to four which will 
enable us to generate bigram, trigram and four-gram language mode^- rfhe smaller the buffer then 
in general the more separate files will be written out - each time the^uffer fills a new n-gram file 
is generated in the output directory, specified by the -d option. 

The -T 1 option switches on tracing at the lowest level. In general yo^^^ould probably aim to 
run each tool with at least -T 1 since this will give you better feedback ah);di^t the progress of the 
tool. Other useful options to pass are -D to check the state of configuration variables - very useful 
to check you have things set up correctly - and -A so that if you save the tooljsutput you will be 
able to see what options it was run with. It's good practice to always pass -T>. i -A -D to every 
HTK tool in fact. You should also note that all HTK tools require the option swi is to be passed 
before the compulsory tool parameters - trying to run LGPrep train/*. txt -T lym result in an 
error, for example. 

Once the operation has completed, the holmes. 0 directory should contain the following files: 
gram.O gram.l gr£mi.2 wmap 

The saved word map file wmap has grown to include all newly encountered words and the identifiers 
that the tool has assigned them, and at the same time the map sequence count has been incremented 
by one. 



^Thc exception to this is that differing text may follow a '/, character. 
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Name = 


Holmes 




SeqNo = 


1 




Entries 


= 18080 




EscMode 


= RAW 




Fields 


= ID.WFC 


\Words\ 






<s> 


65536 


33669 


IT 


65537 


8106 


WAS 


65538 


7595 




Remember that map si 
compatibility between t 



nee count together with the map's name field are used to verify the 
ap and any n-gram files. The contents of the n-gram files can be 
inspected using the LGLi^ tool: (if not using a Unix type system you may need to omit the I 
more and find some other wa^^^ viewing the output in a more manageable format; try > file . txt 
and viewing the resulting file i^^at works) 

$ LGList holmes . 0/wmap holm^)0/gram.2 I more 



4-Grcmi File holmes. O/grsmi. 2 [165674 entries] 



Text Source: Sherlock Holmes 



'CAUSE 

'EM 

</s> 

</s> 

</s> 

</s> 

</s> 

</s> 

</s> 

</s> 

</s> 



IT 
I 

</s> 

<s> 

<s> 

<s> 

<s> 

<s> 

<s> 

<s> 

<s> 

<s> 




BROKEN 
BROWN 
BUZZ 
CAMP 



If you examine the other n-gram files you will notice that whii^ the contents of each n-gram file 
are sorted, the files themselves are not sequenced - that is, sue file does not carry on where the 
previous one left off; each is an independent set of n-grams. T^j^erive a sequenced set of n-gram 
files, where no grams are repeated between files, the tool LGCoP^^ust be used on these existing 
gram files. For the purposes of this tutorial the new set of files be stored in the holmes. 1 
directory, so create this and then run LGCopy: ^^j^ 

$ mkdir holmes. 1 

$ LGCopy -T 1 -b 200000 -d holmes. 1 holmes . 0/wmap holmes . 0/gr^. * 
Input file holmes . 0/gram. 0 added, weight=l . 0000 

Input file holmes .0/gram. 1 added, weight=l . 0000 ^ 
Input file holmes . 0/gram. 2 added, weight=l . 0000 
Copying 3 input files to output files with 200000 entries 

saving 200000 ngrsmis to file holmes . 1/data. 0 

saving 200000 ngrsmis to file holmes . 1/data. 1 

saving 89516 ngrams to file holmes . 1/data. 2 
489516 out of 489516 ngrams stored in 3 files 



o 

% 



The resulting n-gram files, together with the word map, can now be used to generate language 
models for a specific vocabulary list. Note that it is not necessary to sequence the files in this way 
before building a language model, but if you have too many separate unsequenced n-gram files then 
you may encounter performance problems or reach the limit of your filing system to maintain open 
files - in practice, therefore, it is a good idea to always sequence them. 
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15.2 Mapping OOV words 



An important step in building a language model is to decide on the system's vocabulary. For the 
purpose of this tutorial, we have supplied a word list in the file 5k. wlist which contains the 5000 
most common words found in the text. We'll build our language models and all intermediate files 
in the lm_5k directory, so create it with a suitable command: 

$ mkdir lm_5k 

Once the system's .vocabulary has been specified, the tool LGCopy should be used to filter 

out all out-of-vocabu^gCy (OOV) words. To achieve this, the 5K word list is used as a special case 

of a class map which n^gs all OOVs into members of the "unknown" word class. The unknown 

class symbol defaults to^lUNK, although this can be changed via the configuration parameter 

UNKNOWNNAME. Run LGCOTy again: 

• 

$ LGCopy -T 1 -o -m lm_5l^k . wmap -b 200000 -d lm_5k -w 5k. wlist 

holmes . 0/wmap hoMms . 1/data. * 
Input file holmes . 1/data. O'^added, weight=l . 0000 
Input file holmes . 1/data. 1 weight=l . 0000 

Input file holmes . 1/data. 2 ad^ed, weight=l . 0000 
Copying 3 input files to outpu'C^iles with 200000 entries 
Class map = 5k. wlist [Class mapp^^s only] 
saving 75400 ngrams to file lm_5]j^ata.O 
92918 out of 489516 ngrams stored Ifn \L' files 

Because the -o option was passed, all ^Cgrams containing OOVs will be extracted from the 
input files and the OOV words mapped to the^^^known symbol with the results stored in the files 
lm_5k/data. *. A new word map containing th^-^ew class symbols ( ! !UNK in this case) and only 
words in the vocabulary will be saved to lm_5k/5kfwmap. Note how the newly produced OOV 
n-gram files can no longer be decoded by the origin^^ord map holmes . 0/wmap: 

$ LGList holmes . 0/wmap lm_5k/data.O I 

ERROR [+15330] OpenNGramFile : Gram file maV^lmesy.7.5k. wlist 
inconsistent with Holmes ^ 
FATAL ERROR - Terminating program LGList 

The error is due to the mismatch between the original map'^-4^ame ("Holmes") and the name of 
the map stored in the header of the n-gram file we attempted-ro Mst ("Holmes%%5k.wlist" ). The 
latter name indicates that the word map was derived from theXoriginal map Holmes by resolving 
class membership using the class map 5k. wlist. As a further c^^istency indicator, the original 
map has a sequence count of 1 whilst the class-resolved map has a^^^^^uence count of 2. 
The correct command for listing the contents of the OOV n-gram^^l^ is: 



LGList lm_ 


5k/5k.wmap 


lm_5k/data.O I more 




-Grcun File 


lm_5k/data 


0 [75400 


entries] : 




Text Source 


: LGCopy 








!UNK 


! !UNK 


! !UNK 


! !UNK 


50 


!UNK 


! !UNK 


! !UNK 


</s> 


20 


!UNK 


! !UNK 


! !UNK 


A 


2 


!UNK 


! !UNK 


! !UNK 


ACCOUNTS 


1 


!UNK 


! !UNK 


! !UNK 


ACROSS 


1 


!UNK 


! !UNK 


! !UNK 


AND 


17 
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At the same time the class resolved map lm_5k/5k.wmap can be used to list the contents of the 
n-gram, database files - the newer map can view the older grams, but not vice-versa. 



$ LGList lm_5k/5k.wmap holmes. 1/data. 2 I more 



4-Gram File holmes. 1/data. 2 [89516 entries] 
Text Source : LGCopy 
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THE 
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However, any n-grams containing OOV words will be discarded since these are no longer in the 
word map. ^ 

Note that the reqU^d word map lm_5k/5k.wmap can also be produced also using the LSubset 
tool: "2^. 

$ LSubset -T 1 holmes . O/wmap 5k.wlist lm_5k/5k.wmap 

• 

Note also that had the -'^^ption not been passed to LGCopy then the n-gram files built in 
lm_5k would have contained atJ^only those with OOV entries but also all the remaining purely 
in-vocabulary words, the unioh^pfxthose shown by the two preceding LGList commands, in fact. 
The method that you choose to i^^epends on what experiments you are performing - the HTK 
tools allow you a degree of flexibility: 

15.3 Language model ^^eration 

Language models are built using the LBuiy^ command. If you're constructing a class-based model 
you'll also need the Cluster tool, but for rtCm we'll construct a standard word n-gram model. 

You'll probably want to accept the default of using Turing-Good discounting for your n-gram 
model, so the first step in generating a language m»del is to produce a frequency of frequency (FoF) 



table for the chosen vocabulary list. This is pepformed automatically by LBuild, but optionally 
you can generate this yourself using the LFoF todij^nd pass the result into LBuild. This has only 
a negligable effect on computation time, but the result is interesting in itself because it provides 
useful information for setting cut-offs. Cut-offs are A^ere you choose to discard low frequency 
events from the training text - you might wish to do xh^ to decrease model size, or because you 
judge these infrequent events to be unimportant. ^ _ 

In this example, you can generate a suitable table fromV-he language model databases and the 
newly generated OOV n-gram files: • > 

$ LFoF -T 1 -n 4 -f 32 lm_5k/5k.wmap lm_5k/5k.fof ^ i 

holmes . 1/data. * lm_5k/data. * \ 
Input file holmes . 1/data. 0 added, weight=l . 0000 O 
Input file holmes . 1/data. 1 added, weight=l . 0000 
Input file holmes . 1/data. 2 added, weight=l . 0000 * r\ 

Input file lm_5k/data.O added, weight=l . 0000 JVA 
Calculating FoF table 

o 

After executing the command, the FoF table will be stored in lm_5k/^.fof. It shows the 
number of times a word is found with a given frequency - if you recall tnfe definition of Turing- 
Good discounting you will see that this needs to be known. See chapter 16 for'further details of the 
FoF file format. 

You can also pass a configuration parameter to LFoF to make it output a relai@l table showing 
the number of n-grams that will be left if different cut-off rates are applied. RerurfdiFoF and also 
pass it the existing configuration file conf ig: 

$ LFoF -C config -T 1 -n 4 -f 32 lm_5k/5k.wmap lm_5k/5k.fof 

holmes . 1/data. * lm_5k/data.* 
Input file holmes . 1/data. 0 added, weight=l . 0000 
Input file holmes . 1/data. 1 added, weight=l . 0000 
Input file holmes . 1/data. 2 added, weight=l . 0000 
Input file lm_5k/data.0 added, weight=l . 0000 
Calculating FoF table 



cutoff 1-g 2-g 3-g 4-g 
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The information can be interpreted as follows. A bigram cut-off value of 1 will leave 49014 bigrams in 
the model, whilst a trigram cut-off of 3 will result in 17945 trigrams in the model. The configuration 
file conf ig forces the tool to print out this extra information by setting LPCALC: TRACE=3. This is 
the trace level for on^f the library modules, and is separate from the trace level for the tool itself 
(in this case we are pc^^g -T 1 to set trace level 1. The trace field consists of a series of bits, so 
setting trace 3 actually^A^ns on two of those trace flags. 

We'll now proceed to^aild our actual language model. In this the model will be generated in 
stages by executing the LB^JILD separately for each of the unigram, bigram and trigram sections of 
the model (we won't build a»^*gram model in this example, although the n-gram files we've build 
allow us to do so at a later dace-if we so wish) , but you can build the final trigram in one go if you 
like. The following command will generate the unigram model: 

$ LBuild -T 1 -n 1 lm_5k/5k.i^ap lm_5k/ug 
holmes . 1/data. * lm_5lt^^ta. * 

Look in the lm_5k directory and you'll(^^cover the model ug which can now be used on its own as 
a complete ARPA format unigram langisjafe model. 

We'll now build a bigram model wime,» cut-off of 1 and to save regenerating the unigram 
component we'll include our existing unigranijmodel: 

V 

$ LBuild -C config -T 1 -t lm_5k/5k.f of Ojc 2 1 -n 2 
-1 lm_5k/ug lni_5k/5k.wmap lm_5K/ft^l 
holmes . 1/data. * lm_5k/data.* ^ 

Passing the config file again means that we get gi'^cK some discount coefficient information. Try 
rerunning the tool without the -C config to see the difference. We've also passed in the existing 
lm_5k/5k. f of file although this is not necessary - try Mirir^ing this and you'll find that the resulting 
file is identical. What will be different, however, is that che tool will print out the cut-off table seen 
when running LFoF with the LPCALC : TRACE = 3 parami^|^ set; if you don't want to see this then 
don't set LPCALC: TRACE = 3 in the configuration file (try^running the above command without 
-t and -C). 

Note that this bigram model is created in HTKs own binary WrsjjDn of the ARPA format language 
model, with just the unigram component in text format by default. This makes the model more 
compact and faster to load. If you want to override this then sii^pJy add the -f TEXT parameter 
to the command. v>^ 

Finally, the trigram model can be generated using the command: 

$ LBuild -T 1 -c 3 1 -n 3 -1 lm_5k/bgl V^^) 
lm_5k/5k . wmap lm_5k/tgl_l O 
holmes . 1/data. * lm_5k/data.* 

Alternatively instead of the three stages above, you can also build the finai trigram in one step: 

$ LBuild -T 1 -c 2 1 -c 3 1 -n 3 lm_5k/5k.wmap 

lm_5k/tg2-l_l holmes. 1/data.* lm_5k/data.* 

If you compare the two trigram models you'll see that they're the same size - ther^will probably 
be a few insignificant changes in probability due to more cumulative rounding errors incorporated 
in the three stage procedure. 



ere wil 



15.4 Testing the LM perplexity 

Once the language models have been generated, their "goodness" can be evaluated by computing 
the perplexity of previously unseen text data. This won't necessarily tell you how well the language 
model will perform in a speech recognition task because it takes no account of acoustic similarities 
or the vagaries of any particular system, but it will reveal how well a given piece of test text is 
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modelled by your language model. The directory test contains a single story which was withheld 
from the training text for testing purposes - if it had been included in the training text then it 
wouldn't be fair to test the perplexity on it since the model would have already 'seen' it. 

Perplexity evaluation is carried out using LP lex. The tool accepts input text in one of two 
forms - either as an HTK style MLF (this is the default mode) or as a simple text stream. The 
text stream mode, specified with the -t option, will be used to evaluate the test material in this 
example. 

$ LPlex -n 2 -n 3 -t lm_5k/tgl_l test/red-headed_league .txt 
LPlex test #0: 2-^^ 

perplexity 131.8723^var 7.8744, uttersinces 556, words predicted 8588 
num tokens 10408, 0^^65, DOV rate 6.757. (excl. </s>) 

Access statistics for ]Bi_5k/tgl_l : 

Lang model requested e^c^t backed n/a mean stdev 

bigram 8588 78^ 20.6% 0.47. -4.88 2.81 

trigram 0 0-®A\ 0-0°/» 0.07. 0.00 0.00 
LPlex test #1: 3-gram 

perplexity 113.2480, var 8.9^^, utterances 556, words predicted 8127 

num tokens 10408, OOV 665, DOV(^te 6.757. (excl. </s>) 

Access statistics for lm_5k/tgl_?^A. 

Lang model requested exact backea ,,n/a mean stdev 

bigram 5357 68.27. 31.17.\ 0.67. -5.66 2.93 

trigram 8127 34.17. 30.27. ^^.77. -4.73 2.99 

The multiple -n options instruct LPlex to pe^itfn two separate tests on the data. The first test 
(-n 2) will use only the bigram part of the model j( and unigram when backing off), whilst the 
second test (-n 3) will use the full trigram modeVTOT each test, the first part of the result gives 
general information such as the number of utterancel^and tokens encountered, words predicted and 
OOV statistics. The second part of the results gives es^mjcit access statistics for the back off model. 
For the trigram model test, the total number of words pj^icted is 8127. From this number, 34.1% 
were found as explicit trigrams in the model, 30.2% were^ci»nputed by backing off to the respective 
bigrams and 35.7% were simply computed as bigrams by 9«ortening the word context. 

These perplexity tests do not include the prediction of wc*ds from context which includes OOVs. 
To include such n-grams in the calculation the -u option sho^^^be used. 

$ LPlex -u -n 3 -t lm_5k/tgl_l test/red-headed_league'Ctxt 
LPlex test #0: 3-gram o 
perplexity 117.4177, var 8.9075, uttersinces 556, words ^^)iicted 9187 
num tokens 10408, OOV 665, DOV rate 6.757. (excl. </s>) 

Access statistics for lm_5k/tgl_l: 

Lang model requested exact backed n/a mean stdev 

bigram 5911 68.57. 30.97. 0.67. -5.75 2.94 

trigram 9187 35.77. 31.27. 33.27. -4.77 2.98 , 

The number of tokens predicted has now risen to 9187. For analysing OOV ratCsXhe tool provides 
the -o option which will print a list of unique OOVs encountered together witl^^eir occurrence 
counts. Further trace output is available with the -T option. 

15.5 Generating and using count-based models 

The language models generated in the previous section are static in terms of their size and vocab- 
ulary. For example, in order to evaluate a trigram model with cut-offs 2 (bigram) and 2 (trigram) 
the user would be required to rebuild the bigram and trigram stages of the model. When large 
amounts of text data are used this can be a very time consuming operation. 

The HLM toolkit provides the capabilities to generate and manipulate a more generic type of 
model, called a count-based models, which can be dynamically adjusted in terms of its size and 
vocabulary. Count-based models are produced by specifying the -x option to LBuild. The user 



15.6 Model interpolation 



213 



may set cut-off parameters which control the initial size of the model, but if so then once the model 
is generated only higher cut-off values may be specified in the subsequent operations. The following 
command demonstrates how to generate a count-based model: 

$ LBuild -C config -T 1 -t lm_5k/5k.fof -c 2 1 -c 3 1 
-X -n 3 lm_5k/5k.wmap lni_5k/tgl_lc 
holmes . 1/data. * lm_5k/data.O 

Note that in the above example the full trigram model is generated by a single invocation of the 
tool and no intermediate files are kept (i.e. the unigram and bigram models files). 

The generated mS^ can now be used in perplexity tests and different model sizes can be 
obtained by specifying cut-off values via the -c option of LPlex. Thus, using a trigram model 
with cut-ofTs (2,2) gives 

$ LPlex -c 2 2 -c 3 2 -«r 1 -u -n 3 -t lin_5k/tgl_lc 
test/red-headedjj^kffue .txt 

LPlex test #0: 3-gram 

Processing text stream: tesW^:^-headed_league .txt 

perplexity 126.2665, var 9 . 05^9', utteraoices 556, words predicted 9187 

num tokens 10408, OOV 665, OOvQvite 6.757. (excl. </s>) 

and a model with cut-offs (3,3) gives 

$ LPlex -c 2 3 -c 3 3 -T 1 -u -n 3 -^^-^_5k/tgl_lc 
test/red-headed leaeiue.txt 



Processing text stream: test/red-headed_Qea.gue .txt 

perplexity 133.4451, var 9.0880, uttercaicai^556 , words predicted 9187 
num tokens 10408, OOV 665, OOV rate 6.757. (s^l. </s>) 

However, the count model tgl_lc cannot be used d(fectly in recognition tools such as HVite 
or HLvx. An ARPA style model of the required size suit2l^e for recognition can be derived using 
the HLMCOPY tool: ^ 



'•6 



$ HLMCopy -T 1 lm_5k/tgl_lc lm_5k/rtgl_l 

This will be the same as the original trigram model built abo^jwith the exception of some in- 
significant rounding differences. \y 

15.6 Model interpolation ^^1^ 

The HTK language modelling tools also provide the capabilities to prc(9^ce and evaluate inter- 
polated language models. Interpolated models are generated by combining^ number of existing 
models in a specified ratio to produce a new model using the tool LMergk Furthermore, LPlex 
can also compute perplexities using linearly interpolated n-gram probabilities from a number of 
source models. The use of model interpolation will be demonstrated by combi^i^g the previously 
generated Sherlock Holmes model with an existing 60,000 word business news dom^^ trigram model 
(60bn_tg. Im). The perplexity measure of the unseen Sherlock Holmes text using th€^usiness news 
model is 297 with an OOV rate of 1.5%. (LPlex -t -u 60kbn_tg.lm test/*). Mthe following 
example, the perplexity of the test date will be calculated by combining the two models in the ratio 
of 0.6 60kbn_tg.lm and 0.4 tgl_lc: 

$ LPlex -T 1 -u -n 3 -t -i 0.6 . /60kbn_tg . Im 

lm_5k/tgl_lc test/red-headed_league . txt 
Loading leoiguage model from lm_5k/tgl_lc 
Loading leinguage model from . /60kbn_tg. Im 
Using language model (s): 

3-gram lm_5k/tgl_lc , weight 0.40 

3-gram . /60kbn_tg. Im, weight 0.60 
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Found 60275 unique words in 2 model (s) 
LPlex test #0: 3-gram 

Processing text stream: test/red-headed_league .txt 

perplexity 188.0937, var 11.2408, utterances 556, words predicted 9721 
num tokens 10408, OOV 131, OOV rate 1.337. (excl. </s>) 

Access statistics for lm_5k/tgl_lc : 

Lang model requested exact backed n/a mean stdev 

bigram 5479 68.07, 31.37. 0.67. -5.69 2.93 

trigram 8^ 34.27. 30.67. 35.17. -4.75 2.99 

Access statistics foi>A/60kbn_tg.lm: 

Lang model requested'^xact backed n/a mean stdev 

bigram 5034 •83.07. 17.07. 0.17. -7.14 3.57 

trigram 9683 4^7 26.97. 25.17. -5.69 3.53 

A single combined model car^^ generated using LMerge: 

$ LMerge -T 1 -i 0.6 . /60kbnj(*&. Im 5k_unk.wlist 
lm_5k/rtgl_l 5k_mergedy^^ 

Note that LMerge cannot merge cour^^ased models, hence the use of lm_5k/rtgl_l instead of its 
count-based equivalent lm_5k/tgl_lc. FtiAhermore, the word list supplied to the tool also includes 
the OOV symbol ( ! ! UNK) in order to pr^ejve OOV n-grams in the output model which in turn 
allows the use of the -u option in LPlex. 

Note that the perplexity you will obtaiivwith this combined model is much lower than that 
when interpolating the two together because iJlj^word list has been reduced from the union of the 
60K and 5K ones down to a single 5K list. Yorfsan build a 5K version of the 60K model using 
HLMCOPY and the -w option, but first you neecLtji tonstruct a suitable word list - if you pass it 
the 5k_unk.wlist one it will complain about theNvfJWJs in it that weren't found in the language 
model. In the extras subdirectory you'll find a Perl seript to rip the word list from the 60kbn_tg . Im 
model, getwordlist.pl, and the result of running it ^i-*6pk.wlist (the script will work with any 
ARPA type language model). The intersection of the 60^Vnd 5K word lists is what is required, so 
if you then run the extras/intersection. pi Perl script, ^^ended to use suitable filenames, you'll 
get the result in 60k-5k-int .wlist. Then HLMCopy can be used to produce a 5K vocabulary 
version of the 60K model: * ^ 

$ HLMCopy -T 1 -w 60k-5k-int .wlist 60kbn_tg.lm 5kbn_t,g<^m 

This can then be linearly interpolated with the previous 5K modeGi^ compare the perplexity result 
with that obtained from the LMERGE-generated model. If you this you will find that the 
perplexities are similar, but not exactly the same (a perplexity of ll^^ith the merged model and 
114 with the two models linearly interpolated, in fact) - this is because-B^ng LMerge to combine 
two models and then using the result is not precisely the same as linearlyroterpolating two separate 
models; it is similar, however. 

It is also possible to add to an existing language model using the Lj^apt tool, which will 
construct a new model using supplied text and then merge it with the existijig one in exactly the 
same way as LMerge. Effectively this tool allows you to short-cut the process hr7V)erforming many 
operations with a single command - see the documentation in section 17.24 for i^i^ details. 

15.7 Class-based models 

A class-based n-gram model is similar to a word-based n-gram in that both store probabilities n- 
tuples of tokens - except in the class model case these tokens consist of word classes instead of words 
(although word models typically include at least one class for the unknown word). Thus building 
a class model involves constructing class n-grams. A second component of the model calculates 
the probability of a word given each class. The HTK tools only support deterministic class maps, 
so each word can only be in one class. Class language models use a separate file to store each of 
the two components - the word-given-class probabilities and the class n-grams - as well as a third 
file which points to the two component files. Alternatively, the two components can be combined 
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together into a standalone separate file. In this section we'll see how to build these files using the 
supphed tools. 

Before a class model can be built it is necessary to construct a class map which defines which 
words are in each class. The supplied Cluster tool can derive a class map based on the bigram 
word statistics found in some text, although if you are constructing a large number of classes it can 
be rather slow (execution time measured in hours, typically). In many systems class models are 
combined with word models to give further gains, so we'll build a class model based on the Holmes 
training text and then interpolate it with our existing word model to see if we can get a better 
overall model. 

Constructing a cl^^map requires a decision to be made as to how many separate classes are 
required. A sensible ni^Aer depends on what you are building the model for, and whether you 
intend it purely to interpoiiite with a word model. In the latter case, for example, a sensible number 
of classes is often around'roe 1000 mark when using a 64K word vocabulary. We only have 5000 
words in our vocabulary so»we'll choose to construct 150 classes in this case. 

Create a directory called t^fi.mes . 2 and run Cluster with 



$ Cluster -T 1 -c 150 -i iy^k -o holmes . 2/class lm_5k/5k.wmap 

holmes. 1/data.* Im^^Aiata.O 
Preparing input grarni set ^<0 

Input gram file holmes . 1/data. »S^dded (weight=l . 000000) 

Input gram file lm_5k/data.O adhesd. (weight=l . 000000) 

Beginning iteration 1 

Iteration complete y-^ 

Cluster completed successfully \^ 

The word map and gram files are passed a^^efore - any OOV mapping should be made before 
building the class map. Passing the -k option tisJOjCLUSTER to keep the unknown word token ! !UNK 
in its own singleton class, whilst the -c 150 ojj^ons specifies that we wish to create 150 classes. 
The -i 1 performs only one iteration of the clus<^er - performing further iterations is likely to 
give further small improvements in performance, but^e won't wait for this here. Whilst Cluster 
is running you can look at the end of the holmes . 2/ciass . 1 . log to see how far it has got. On 
a Unix-like system you could use a command like taiVTjolmes . 2/class . 1 . log, or if you wanted 
to monitor progress then tail -f holmes . 2/class . 1 . ipg would do the trick. The 1 refers to the 
iteration, whilst the results are written to this filename b^c^se of the -o holmes . 2/class option 
which sets the prefix for all output files. ^ 

In the holmes . 2 directory you will also see the files class ^f^overy and class . recovery . cm - 
these are a recovery status file and its associated class map wPMi^re exported at regular intervals 
because the Cluster tool can take so long to run. In this way^ou can kill the tool before it has 
finished and resume execution at a later date by using the -x optiS^ in this case you would use -x 
holmes. 2/class. recovery for example (making sure you pass the^jijie word map and gram files 

- the tool does not currently check that you pass it the same files whefl^restarting) . 

Once the tool finishes running you should see the file holmes . 2/ claai^l . cm which is the result- 
ing class map. It is in plain text format so feel free to examine it. Note, tot example, how CLASS23 
consists almost totally of verb forms ending in -ED, whilst CLASS41 lists various general words for 
a person or object. Had you created more classes then you would be likely(ta see more distinctive 
classes. We can now use this file to build the class n-gram component of our Janguage model. 

$ LGCopy -T 1 -d holmes. 2 -m holmes . 2/cmap -w holmes . 2/class . 1 . cm 

lm_5k/5k.wmap holmes . 1/data. * lm_5k/data.O 
Input file holmes. 1/data.O added, weight=l . 0000 '^P^ 
Input file lm_5k/data.0 added, weight=l . 0000 ^ 
Copying 2 input files to output files with 2000000 entries 
Class map = holmes . 2/class . 1 . cm 

saving 162397 ngramis to file holmes . 2/data. 0 
330433 out of 330433 ngrams stored in 1 files 

The -w option specifies an input class map which is applied when copying the gram files, so we 
now have a class gram file in holmes . 2/ data . 0. It has an associated word map file holmes . 2/ cmap 

- although this only contains class names it is technically a word map since it is taken as input 
wherever a word map is required by the HTK language modelling tools; recall that word maps can 
contain classes as witnessed by ! !UNK previously. 



15.7 Class-based models 



216 



You can examine the class n-grams in a similar way to previously by using LGLlST 
$ LGList holmes . 2/ cmap holmes. 2/data.O I more 

3-Gram File holmes. 2/data.O [162397 entries]: 
Text Source : LGCopy 



CLASS 1 
CLASS 1 
CLASS 1 
CLASS 1 
CLASS 1 
CLASS 1 
CLASS 1 



CLASSIC 
CLASSIC 
CLASSIC 
CLASS 
CLASS1?> 
CLASSIC' 



CLASSIC 



CLASS103 
CLASSll 
CLASS118 
CLASS12 
CLASS126 
. CLASS14C 
•%aCLASS147 



And similarly the class n- 
as previously with 



component of the overall language model is built using LBuild 



$ LBuild -T 1 -c 2 1 -c 3 1 "^^K^ holmes . 2/cmap 

lm_5k/cll5C-tg_l_l . cc ho^rfes . 2/data. * 
Input file holmes. 2/data.O add@ weight=l . CCOO 

To build the word-given-class comj^^^t of the model we must run Cluster again. 

$ Cluster -1 holmes . 2/class . 1 . cm -i\^'-q lm_5k/cll50-counts.wc 
lm_5k/5k.wmap holmes . 1/data. * In^^S^k/data.C 

This is very similar to how we ran ClusI;^ earlier, except that we now want to perform 0 
iterations (-i 0) and we start by loading in the ^cffeting class map with -1 holmes . 2/ class . 1 . cm. 
We don't need to pass -k because we aren't doing ajij^ further clustering and we don't need to specify 
the number of classes since this is read from the da^map along with the class contents. The -q 
lm_5k/cll50-counts .wc option tells the tool to write word-given-class counts to the specified file. 
Alternatively we could have specified -p instead of a,nd written probabilities as opposed to 
counts. The file is in a plain text format, and either th^^p or -q version is sufficient for forming 
the word-given-class component of a class language model^^ote that in fact we could have simply 
added either -p or -q the first time we ran Cluster and generated both the class map and language 
model component file in one go. 

Given the two language model components we can now 
class n-gram language model. 



linl^li^: 



hem together to make our overall 



$ LLink lm_5k/cll5C-counts .wc lm_5k/cll50-tg_l_l . cc 
lm_5k/cll5C-tg_l_l 



o 



The LLiNK tool creates a simple text file pointing to the two ngp^sary components, auto- 
detecting whether a count or probabilities file has been supplied. The resutuHg file, lm_5k/ cll5C-tg_l_l 
is the finished overall class n-gram model, which we can now assess the perfep^ance of with LPlex. 

$ LPlex -n 3 -t lm_5k/cll50-tg_l_l test/red-headed_league . txt 
LPlex test #0: 3-p;ram 

% 



-gram 

perplexity 125.9065, var 7.4139, utterances 556, words predicted 
num tokens 10408, OOV 665, DOV rate 6.75% (excl. </s>) 



Access statistics for lm_5k/cll50-tg_l_l : 
Lang model requested exact backed n/a mean stdev 
bigram 2867 95.4% 4.6% C.0% -4.61 1.64 

trigram 8127 64.7% 24.1% 11.2% -4.84 2.72 

The class trigram model performs worse than the word trigram (which had a perplexity of 
117.4), but this is not a surprise since this is true of almost every reasonably-sized test set - the 
class model is less specific. Interpolating the two often leads to further improvements, however. We 
can find out if this will happen in this case by interpolating the models with LPlex. 
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$ LPlex -u -n 3 -t -i 0.4 lm_5k/cll50-tg_l_l lm_5k/tgl_l 

test/red-headed_league . txt 
LPlex test #0: 3-gram 

perplexity 102.6389, var 7.3924, uttercinces 556, words predicted 9187 
num tokens 10408, OOV 665, OOV rate 6.757, (excl. </s>) 

Access statistics for lm_5k/tg2-l_l : 

Lang model requested exact backed 

bigram 5911 68.57, 30.97, 

trigram 9^7 35.77, 31.27, 

Access statistics f oi\Am_5k/cll50-tg 
Lang model requested'^xact backed 
bigram 3104 •95.57, 4.57. 

trigram 9187 6^7 23.97. 

So a further gain is obtained -vfrie interpolated model performs significantly better. Further im- 
provement might be possible by ^lepapting to optimise the interpolation weight. 

Note that we could also have if^L LLink to build a single class language model file instead of 
producing a third file which points t^he two components. We can do this by using the -s single 
file option. 

$ LLink -s lm_5k/cll50-counts .wc i05k/cll5O-tg_l_l.cc 
lm_5k/cll50-tg_l_l.all yA' 



n/ a 


mean 


stdev 


0.67, 


-5.75 


2.94 


33 . 27 


-4.77 


2.98 


,1_1: 






n/ a 


mean 


stdev 


0.07 


-4.67 


1.62 


9.97 


-4.87 


2.75 



The file lm_5k/cll50-tg_l_l . all is now a st^dalone language model, identical in performance to 
lm_5k/cll50-tg_l_l created earlier. \^ 

15.8 Problem solving ^ 

Sometimes a tool returns an error message which doe^n^t seem to make sense when you check the 
files you've passed and the switches you've given. This seja^i^on provides a few problem-solving hints. 



15.8.1 File format problems ^ 

If a file which seems to be in the correct format is giving ers^^ such as 'Bad header' then make 
sure that you are using the correct input filter. If the file is gzip^ed then ensure you are using a 
suitable configuration parameter to decompress it on input; similarly if it isn't compressed then 
check you're not trying to decompress it. Also check to see if youHiave two files, one with and one 
without a . gz extension - maybe you're picking up the wrong one^^^ checking the other file. 

You might be missing a switch or configuration file to tell the to^Vhich format the file is in. 
In general none of the HTK language modelling tools can auto-detect,fip\formats - unless you tell 
them otherwise they will expect the file type they are configured to defauI^-to and will give an error 
relevant to that type if it does not match. For example, if you omit to pss^t to LPlex then it 
will treat an input text file as a HTK label file and you will get a 'Too muny columns' error if a 
line has more than 100 words on it or a ridiculously high perplexity otherwise.* Check the command 
documentation in chapter 17. 

15.8.2 Command syntax problems 

If a tool is giving unexpected syntax errors then check that you have placed all the option switches 
before the compulsory parameters - the tools will not work if this rule is not followed. You must also 
place whitespace between switches and any options they expect. The ordering of switches is not 
important, but the order of compulsory parameters cannot be changed. Check the switch syntax - 
passing a redundant parameter to one will cause problems since it will be interpreted as the first 
compulsory parameter. 

All HTK tools assume that a parameter which starts with a digit is a number of some kind - 
you cannot pass filenames which start with a digit, therefore. This is a limitation of the routines 
in HShell. 
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15.8.3 Word maps 

If your word map and gram file combination is being rejected then make sure they match in terms 
of their sequence number. Although gram files are mainly stored in a binary format the header is 
in plain text and so if you look at the top of the file you can compare it manually with the word 
map. Note it is not a good idea to fiddle the values to match since they are bound to be different 
for a good reason! Word maps must have the same or a higher sequence id than a gram file in order 
to open that gram file - the names must match too. 

The tools might not behave as you expect. For example, LGPrep will write its word map to 
the file wmap unless yp*^ tell it otherwise, irrespective of the input filename. It will also place it in 
the same directory a^rtue gram files unless you changed its name from wmap(!) - check you are 
picking up the correct ^^i'd map when building subsequent gram files. 

The word ids start at^S36 in order to allow space for that many classes below them - anything 
lower is assumed to be a clg,ss. In turn the number of classes is limited to 65535. 



15.8.4 Memory prob 

Should you encounter memory p^^lrans then try altering the amount of space reserved by the tools 
using the relevant tool switches s«ra as -a and -b for LGPrep and LGCopy. You could also try 
turning on memory tracing to see KowNmuch memory is used and for what (use the configuration 
TRACE parameters and the -T option as appropriate. Language models can become very large, 
however - hundreds of megabytes in sixef for example - so it is important to apply cut-offs and/or 
discounting as appropriate to keep them^t^ a suitable size for your system. 

15.8.5 Unexpected perplexitiesvP 

If perplexities are not what you expected, thervtnere are many things that could have gone wrong 
- you may not have constructed a suitable mod^ - but also some mistakes you might have made. 
Check that you passed all the switches you inte\^eci, and check that you have been consistent 
in your use of *RAW* configuration parameters - u^ifg escaped characters in the language model 
without them in your test text will lead to unexpectej^ results. If you have not escaped words in 
your word map then check they're not escaped in any c^jas map. When using a class model make 
sure you're passing the correct input file of the three separate components. 

Check the switches to LPlex - did you set -u as you ii(^hded? If you passed a text file did you 
pass -t? Not doing so will lead either to a format error or 1^ extremely bizarre perplexities! 

Did you build the length of n-gram you meant to? Check tja^final language model by looking at 
the header of it, which is always stored in plain text format. Ybu cjin easily see how many n-grams 
there are for each size of n. \ 
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LWMap^^'LCMap 



HTK Language Modellingrtools 



X 



HDict 



HShell 



HMem 



Terminal 
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As noted in the introduction, building a language model with the HTK tools is a two stage 
process. In the first stage, the n-gram data is accumulUoed and in the second stage, language 
models are estimated from this data. The n-gram data ccsisisjs of a set of gram files and their 
associated word map. Each gram file contains a list of n-gra^^s)and their counts. Each n-gram is 
represented by a sequence of n integer word indices and the word^ap relates each of these indices 
to the actual orthographic word form. As a special case, a wordTOM) containing just words and no 
indices acts as a simple word list. 

In many cases, a class map is also required. Class maps give a^^aasie to a subset of words and 
associate an index with that name. Once defined, a class name can be^i^d like a regular word and, 
in particular, it can be listed in a word map. In their most general class maps are used to 

build class-based language models. However, they are also needed for ni^t word-based language 
models because they allow a class to be defined for unknown words. ^^^^ 

This chapter includes descriptions of three basic types of data file: gramfele, word map file, and 
class map file. As shown in the figure, each file type is supported by a specific HTK module which 
provides the required input and output and the other related operations. 

Also described in this chapter is a fourth type of data file called a frequency- of ^yquency or FoF 
file. A FoF file contains a count of the number of times an n-gram occurred just op^, twice, three 
times, etc. It is used with Turing-Good discounting (although LBuild can generat^^^F counts on 
the fiy) but it can also be used to estimate the number of n-grams which will be included in the 
final language models. 

The various language model formats, for both word and class-based models, are also described 
in detail. 

Trace levels for each language modelling library modules are also described - see the tool docu- 
mentation for details of tool level tracing options. Finally, run-time and compile-time settings are 
documented. 
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16.1 Words and Classes 

In the HTK language modelhng tools, words and classes are represented internally by integer indices 
in the range 0 to 2^^^ — 1 (16777215). This range is chosen to fit exactly into 3 8-bit bytes thereby 
allowing efficient compaction of large lists of n-gram counts within gram files. 

These integer indices will be referred to subsequently as ids. Class ids are limited to the range 
0 to 2^^ — 1 and word ids fill the remaining range of 2^^ to 2^^ — 1. Thus, any id with a zero most 
significant byte is a class id and all other ids are word ids. 

In the context of word maps, the term word may refer to either an orthographic word or the 
name of a class. Thus^^ its most general form, a word map can contain the ids of both orthographic 
words from a source tex^ and class names defined in one or more class maps. 

The mapping of ortmMjraphic words to ids is relatively permanent and normally takes place 
when building gram files ^:^ing LGPrep). Each time a new word is encountered, it is allocated a 
unique id. Once allocated, •a word id should never be changed. Class ids, on the other hand, are 
more dynamic since their de^^ion depends on the language model being built. Finally, composite 
word maps can be derived fronj^coUection of word and class maps using the tool LSubset. These 
derived word maps are typicall;^M!ised to define a working subset of the name space and this subset 
can contain both word and class Kts-> 

16.2 Data File Headers 



5), 



All the data files have headers containing'^formation about the file and the associated environment. 
The header is variable-size being terminat^d^lDy a data symbol (e.g. \Words\ \Grams\ \FoFs\, etc) 
followed by the start of the actual data. 




Each header field is written on a separate''^^ in the form 
<Field> = <value> 

where <Field> is the name of the field and <valu'^ is. its value. The field name is case insensitive 
and zero or more spaces can surround the = sign.^ The <value> starts with the first printing 
character and ends at the last printing character on tfe>line. HTK style escaping is never used in 
HLM headers. ^ 

Fields may be given in any order. Field names whicbrare unrecognised by HTK are ignored. 



Further field names may be introduced in future, but these are guaranteed not to start with the 
letter "U". • > 

(NB. The above format rules do not apply to the files descr(b^d in section 16.8 - see that section 
for more details) 

o 

16.3 Word Map Files ^ 

A word map file is a text file consisting of a header and a list of word ejj^tkies. The header contains 
the following 

1. a name consisting of any printable character string (Name=sss). 

2. the number of word entries (Entries=nnn) • 

o 
a 

whether or not word ids IDs and word frequency counts WFCs are included'^^ields=ID or 
Fields=ID,WFC). When the Fields field is missing, the word map contains only word names 
and it degenerates to the special case of a word list. 



a sequence number (SeqNo=nnn) 



5. escaping mode (EscMode=HTK or EscMode=RAW). The default is HTK. 

6. the language (Laiiguage=xxx) 

7. word map source, a text string used with derived word maps to describe the source from 
which the subset was derived. (Source=. . .). 
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The first two of these fields must always be included, and for word maps, the Fields field must 
also be included. The remaining fields are optional. More header fields may be defined later and 
the user is free to insert others. 

The word entries begin with the keyword \Words\. Each word is on a separate line with the 
format 

word [id [count] ] 

where the id and count are optional. Proper word maps always have an id. When the count is 
included, it denotes the number of times that the word has been encountered during the processing 
of text data. <^ 

For example, a typi^l^word map file might be 

Name=US_Business_^j(rs 
SeqNo=13 ^ 
Entries=133986 ^ 

IS 65540 4593 \^ 

'^^ 

In this example, the word map is called "US^^VisinessJMews" and it has been updated 13 times 
since it was originally created. It contains a tot^f^f 133986 entries and word frequency counts are 
included. The language is "American" and there no escaping used (e.g. can't is written CAN'T 
rather than the standard HTK escaped form of CANV^) . 

As noted above, when the Fields field is missinSg, the word map contains only the words and 
serves the purpose of a simple word list. For example, k^uical word list might be defined as follows 



Fields=ID,WFC 
LEinguage=Americaji 
EscMode=RAW 
\Words\ 

<s> 65536 34850 
CAN'T 65537 2087 
THE 65538 12004 
DOLLAR 65539 169 



Name=US_Business_News 
Entries=10000 
\Words\ 
A 

ABLE 
ABOUT 



CO 



o 

zoo 

Word hsts are used to define subsets of the words in a word map. ^er a tool requires a word 

list, a simple list of words can be input instead of the above. For exwigik, the previous list could 
be input as 

ABLE • 

ABOUT Q), 

ZOO 

In this case, the default is to assume that all input words are escaped. If raw mode input is required, 
the configuration variable INWMAPRAW should be set true (see section 4.6). 

As explained in section 4.6, by default HTK tools output word maps in HTK escaped form. 
However, this can be overridden by setting the LWMap configuration variable OUTWMAPRAW to true. 



16.4 Class Map Files 

A class map file defines one or more word classes. It has a header similar to that of a word map file, 
containing values for Name, Entries, EscMode and Language. In this case, the number of entries 
refers to the number of classes defined. 
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The class definitions are introduced by the keyword \Classes\. Each class definition has a 
single line sub-header consisting of a name, an id number, the number of class members (or non- 
members) and a keyword which must be IN or NOTIN. In the latter case, the class consists of all 
words except those listed i.e. the class is defined by its complement. 

The following is a simple example of a class map file. 

Name=Simple_Classes 
Entries=97 
EscMode=HTK 
LELnguage=Br it 
\Classes\ 

ARTICLES 1 3 IN^^ 
A ^ 
AN , 

THE ^ 
COLOURS 2 4 IN 

RED \^ 

BLUE ^ V 

GREEN ^ 

YELLOW (T) 
SHAPES 3 6 IN /0> 

SQUARE \^ 

CIRCLE 

etc y\) 

This class map file defines 97 distinct classes ,'^^eJirst of which is a class called ARTICLES (id=:l) 
with 3 members: (a, an, the). \ 

For simple word-based language models, the cla^ map file is used to define the class of unknown 
words. This is usually just the complement of the vo^cSbulary list. For example, a typical class map 
file defining the unknown class ! lUNKID might be 

Name=Vocab_65k_V2.3 

Entries=l 

Language=American 

EscMode=NONE *^ 
\Classes\ 

!!UNKID 1 65426 NOTIN 

o 

ABATE r\ 
ABLE ^^r\ 
ABORT l\ 
ABOUND vO 

O 

Since this case is so common, the tools also allow a plain vocabulary list tovbe supplied in place of 
a proper class map file. For example, supplying a class map file containing juBt 

o 

ABATE O. 
ABLE <p 
ABORT 
ABOUND 



would have an equivalent effect to the previous example provided that the LCMap configuration 
variables UNKNOWNID and UNKNOWNNAME have been set in order to define the id and name to be used 
for the unknown class. In the example given, including the following two lines in the configuration 
file would have the desired effect 

LCMAP: UNKNOWNID = 1 

LCMAP : UNKNOWNNAME = ! ! UNKID 
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Notice that the similarity with the special case of word lists described in section 16.3. A plain 
word list can therefore be used to define both a vocabulary subset and the unknown class. In a 
conventional language model, these are, of course, the same thing. 

In a similar fashion to word maps, the input of a headerless class map can be set to raw mode 
by setting the LCMap configuration variable INCMAPRAW and all class maps can be output in raw 
mode by setting the configuration variable DUTCMAPRAW to true. 



16.5 Gram Files 

Statistical language nloaels are estimated by counting the number of events in a sample source text. 
These event counts are^^ed in gram files. Provided that they share a common word map, gram 
files can be grouped tog^^ler in arbitrary ways to form the raw data pool from which a language 
model can be constructed. ,For example, a text source containing 100m words could be processed 
and stored as two gram files^ Afew months later, a 3rd gram file could be generated from a newly 
acquired text source. This neV^j^m file could then be added to the original two files to build a new 
language model. The original swrce text is not needed and the gram files need not be changed. 

A gram file consists of a head^^ojlowed by a sorted list of n-gram counts. The header contains 
the following items, each written df^^ separate line 

1. n-gram size ie 2 for bigrams, 3TOrtrigrams, etc. (Ngram=N) 



2. Word map. Name of word map toyb^used with this gram file. (WMap=wmapnEmie) 

3. First gram. The first n-gram in the (graml = wl w2 w3 . . . ) 

4. Sequence number. If given then the achtfal word map must have a sequence number which is 
greater than or equal to this. (SeqNo=nii!H)) ^ 

5. Last gram. The last 7i-gram in the file (grai^«= wl w2 w3 . . .) 

6. Number of distinct n-grams in file. (Entries ^%J) 

7. Word map check. This is an optional field contaMng- a word and its id. It can be included as 
a double check that the correct word map is beingQised to interpret this gram file. The given 
word is looked up in the word map and if the corre(^onding id does not match, an error is 
reported. (WMCheck = word id) ^ 

8. Text source. This is an optional text string describin^^^e. text source which was used to 
generate the gram file (Source=. . .). 



For example, a typical gram file header might be 



O 

Ngram = 3 

WMap = US_Business_News 
Entries = 50345980 
WMCheck = XEROX 340987 
Graml = AN ABLE ART 
GramN = ZEALOUS ZOO OWNERS 
Source = WSJ Aug 94 to Dec 94 



o 



The n-grams themselves begin immediately following the line containing the k^^^ord XGrsimsX^. 
They are listed in lexicographic sort order such that for the n-gram {wi'W2 ■ ■ ■ WAi'^Pjf i varies the 
least rapidly and wjsi varies the most rapidly. Each n-gram consists of a sequence of N 3-byte 
word ids followed by a single 1-byte count. If the n-gram occurred more than 255 times, then it is 
repeated with the counts being interpreted to the base 256. For example, if a gram file contains 
the sequence 

wl w2 ... wN cl 
wl w2 ... wN c2 
wl w2 ... wN c3 



-"^That is, the first byte of the binary data immediately follows the ncwline character 
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corresponding to the n-gram {wiW2 ■ ■ ■ wn}, the corresponding count is 

C1 + C2* 256 + C3 * 256^ 

When a group of gram files are used as input to a tool, they must be organised so that the 
tool receives n-grams as a single stream in sort order i.e. as far as the tool is concerned, the net 
effect must be as if there is just a single gram file. Of course, a sufficient approach would be to 
open all input gram files in parallel and then scan them as needed to extract the required sorted 
n-gram sequence. However, if two n-gram files were organised such that the last n-gram in one file 
was ordered before tl^ first n-gram of the second file, it would be much more efficient to open and 
read the files in sequeft^. Files such as these are said to be sequenced and in general, HTK tools 
are supplied with a mij2^ sequenced and non-sequenced files. To optimise input in this general 
case, all HTK tools whic^input gram files start by scanning the header fields gremil and gramN. 
This information allows a gequence table to be constructed which determines the order in which 
the constituent gram file muslrbe opened and closed. This sequence table is designed to minimise 
the number of individual graniji^s which must be kept open in parallel. 

This gram file sequencing i^nvisible to the HTK user, but it is important to be aware of it. 
When a large number of gram fif^ara accumulated to form a frequently used database, it may be 
worth copying the gram files using 'COt Copy. This will have the effect of transforming the gram files 
into a fully sequenced set thus ensur^fij^ that subsequent reading of the data is maximally efficient. 

CO 

16.6 Frequency-of-frequ0icy (FoF) Files 

A FoF file contains a list of the number or tjines that an n-gram occurred just once, twice, three 
times, . . . , n times. Its format is similar tova word map file. The header contains the following 
information 

1. n-gram size ie 2 for bigrams, 3 for trigrams;^^G. (Ngram=N) 

2. the number of frequencies counted (i.e. the nu^l^er of rows in the FoF table (Entries=nnn) 

3. Text source. This is an optional text string d^^ibing the text source which was used to 
generate the gram files used to compute this FoF ^le. (Source=. . .). 

More header fields may be defined later and the user is fr^to insert others. 

The data part starts with the keyword \FoFs\. Each cctitjjijas a list of the unigrams, bigrams, 
. . . , n-grams occurring exactly k times, where k is the numb^r)of the row of the table - the first 
row shows the number of n-grams occurring exactly 1 time, for^^jf^mple. 

As an example, the following is a FoF file computed from a se^fe^f trigram gram files. 



Ngram = 3 
Entries = 100 

Source = WSJ Aug 94 to Dec 94 
\FoFs\ 

1020 23458 78654 \J 
904 19864 56089 
... , 

FoF files are generated by the tool LFoF. This tool will also output a list contaC^g an estimate of 
the number of n-grams that will occur in a language model for a given cut-off - set^^e configuration 
parameter LPCALC : TRACE = 3. 



16.7 Word LM file formats 

Language models can be stored on disk in three different file formats - text, binary and ultra. The 
text format is the standard ARPA-MIT formad used to distribute pre-computed language models. 
The binary format is a proprietary file format which is optimised for flexibility and memory usage. 
All tools will output models in this format unless instructed otherwise. The ultra LM format is a 
further development of the binary LM format optimised for fast loading times and small memory 
footprint. At the same time, models stored in this format cannot be pruned further in terms of size 
and vocabulary. 
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16.7.1 The ARPA-MIT LM format 

This format for storing n-gram back-off langauge models is defined as follows 

<LM_def inition> = [ { <commeiit> } ] 
\data\ 
<header> 
<body> 
\end\ 

<coiimient> = { <word> } 

An ARPA-style langu2^ model file comes in two parts - the header and the rt-gram definitions. 
The header contains a ascription of the contents of the file. 



<header> = -fZigram <iiit>=<int> } 

The first <int> gives the rf-gram order and the second <int> gives the number of n-gram entries 
stored. 

For example, a trigram lalt^age model consists of three sections - the unigram, bigram and 
trigram sections respectively. Th£^&>rresponding entry in the header indicates the number of entries 
for that section. This can be usm-A) aid the loading-in procedure. The body part contains all 
sections of the language model andvi^efined as follows: 



<body> = { <lmpartl><A <lmpart2> 
<lmpartl> = \<int>-gram3-y» 
{ <ngramdef l>y^ 
<lmpart2> = \<int>-grams : \^ 

{ <ngramdef2> } \^ 
<ngramdefl> = <float> { <word>'^Kf loat> 
<ngramdef 2> = <float> { <word> 

Each n-gram definition starts with a probability vajire stored as logj^Q followed by a sequence of n 
words describing the actual n-gram. In all sections ejseepts the last one this is followed by a back- 
off weight which is also stored as logj^g. The following^ example shows an extract from a trigram 
language model stored in the ARPA-text format. 0> 

\data\ 

ngram 1=19979 
ngram 2=4987955 
ngram 3=6136155 



\l-grams : 

-1.6682 A -2.2371 

-5.5975 A'S -0.2818 

-2.8755 A. -1.1409 

-4.3297 A.'S -0.5886 

-5.1432 A.S -0.4862 



O 



\2- 


-grams 










-3 


.4627 


A 


BABY 


-0, 


,2884 


-4 


.8091 


A 


BABY'S 


-0, 


,1659 


-5 


.4763 


A 


BACH 


-0, 


,4722 


-3 


.6622 


A 


BACK 


-0, 


,8814 



o 

% 



\3- 


-grams ; 










-4, 


,3813 


! SENT. 


START 


A 


CAMBRIDGE 


-4, 


,4782 


! SENT. 


START 


A 


CAMEL 


-4, 


,0196 


! SENT. 


START 


A 


CAMERA 


-4, 


,9004 


! SENT. 


START 


A 


CAMP 


-3, 


,4319 


! SENT. 


.START 


A 


CAMPAIGN 



\end\ 
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Component 


^ with back-ofF weights 


Total 


unigram 


65,467 


65,467 


bigram 


2,074,422 


6,224,660 


trigram 


4,485,738 


9,745,297 


fourgram 


0 


9,946,193 



Table 16.1: Component statistics for a 65k word fourgram language model with cut-offs: bigram 1, 
trigram 2, fourgram 2. 



16.7.2 The md¥ffied ARPA-MIT format 

The efficient loading ofTfe language model file requires prior information as to memory require- 
ments. Such information ^S^partially available from the header of the file which shows how many 
entries will be found in eack section of the model. From the back-ofF nature of the language model 
it is clear that the back-off'^^ight associated with an n-gram (wi, W2, • ■ ■ , w„_i) is only useful 
when p{wn\wi, word2, ■ ■ ■ , Wn-'t^s an explicitly entry in the file or computed via backing-off to the 
corresponding (n — l)-grams. IiLtifsher words, the presence of a back-ofF weight associated with the 
n-gram wi,W2, ■ ■ ■ , Wn-i can be to indicate the existence of explicit n- grams wi,W2, ■ ■ ■ ,Wn- 
The use of such information can greatly reduce the storage requirements of the language model 
since the back-ofF weight requires eji^^ storage. For example, considering the statistics shown in 
table 16.1, such selective memory allc^^ion can result in dramatic savings. This information is 
accommodated by modifying the syntaxf^d semantics of the rule 

<ngramdefl> = <float> { <worV^'> [ <float> ] 

whereby a back-ofF weight associated with'^^gram {wi,W2, ■ ■ ■ ,w„-i) indicates the existence of 
n-grams {wi,W2, ■ ■ ■ , Wn)- This version will b^J^erred to as the modified ARPA-text format. 

16.7.3 The binary LM format y>' 

This format is the binary version of modified ARPACtes :t format. It was designed to be a compact, 
self-contained format which aids the fast loading of lar^^anguage model files. The format is similar 
to the original ARPA-text format with the following mc^cftfication 

<header> = { (ngram <int>=<int>) I (ngr^S) <int>~<int>) } 

The first alternative in the rule describes a section stored as texjt^ the second one describes a section 
stored in binary. The unigram section of a language model filsLis always stored as text. 

<ngramdef> = <txtgram> I <biiigrcmi> ^ 
<txtgrain> = <float> { <word> } [ <float> ] O 

<bingrain> = <f_type> <f_size> <f_float> { <f_woJ^^)^} [ <f_float> ] 

In the above definition, <f _type> is a 1-byte flags field, <f _size> l-byte unsigned number 
indicating the total size in bytes of the remaining fields, <f _f loat> is O^Bytes field for the n-gram 
probability, <f _word> is a numeric word id, and the last <f _f loat> is back-off weight. The 
numeric word identifier is an unsigned integer assigned to each word in th^^der of occurrence of 
the words in the unigram section. The minimum size of this field is 2-bytes as used in vocabulary 
lists with up to 65,5355 words. If this number is exceeded the field size is automatically extended to 
accommodate all words. The size of the fields used to store the probability and\back-off weight are 
typically 4 bytes, however this may vary on different computer architectures. Tl/Q3east significant 
bit of the flags fleld indicates the presence/absence of a back-off weight with corrd^^nding values 
1/0. The remaining bits of the flags field are not used at present. 



16.8 Class LM file formats 

Class language models replace the word language model described in section 16.7 with an identical 
component which models class n-grams instead of word n-grams. They add to this a second com- 
ponent which includes the deterministic word-to-class mapping with associated word-given-class 
probabilities, expressed either as counts (which are normalised to probabilities on loading) or as 
explicit natural log probabilities. These two components are then either combined into a single file 
or are pointed to with a special link file. 
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16.8.1 Class counts format 

The format of a word-given-class counts file, as generated using the -q option from Cluster, is as 
follows: 

Word I Class counts 

[blank line] 

Derived from: <file> 
Number of classes: <int> 
Number of words: <int> 
Iterations: <int>^ 

[blank line] 

Word Class name CouS-* 

followed by one line for e^k word in the model of the form: 
<word> CLASS<int> <int> 

The fields are mostly sell^^glanatory. The Iterations: header is for information only and 
records how many iterations h^ been performed to produce the classmap contained within the 
file, and the Derived from: h^^^nis similarly also for display purposes only. Any number of 
headers may be present; the headei'^ction is terminated by finding a line beginning with the four 
characters making up Word. The col«®)terminated headers may be in any order. 

CLASS<int> must be the name or s^lass in the classmap (technically actually the wordmap) 
used to build the class-given-class histoir^-gram component of the language model - the file built 
by LBuiLD. In the current implementa^ern these class names are restricted to being of the form 
CLASS<int>, although a modification to ti^code in LModel.C would allow this restriction to be 
removed. Each line after the header specifie^~Jhe count of each word and the class it is in, so for 
example r\ 
THE CLASS73 1859 ^ 

would specify that the word THE was in class CLaSS7§ and occurred 1859 times. 

16.8.2 The class probabilities format C . 

The format of a word-given-class probabilities file, as gengsreited using the -p option from Cluster, 

is very similar to that of the counts file described in the^revious sub-section, and is as follows: 

Word I Class probabilities ^0 

[blank line] » 

Derived from: <file> Y) 

Number of classes: <int> ^> 

Number of words: <int> ^ 

Iterations: <int> 

[blank line] 

Word Class name Probability (log) \^ 
followed by one line for each word in the model of the form: 
<word> CLASS<int> <float> 



O. 



As i„ the p„vious section, the flelds «e ™tl, self-expla„atoty. The jXration.: header is 

for information only and records how many iterations had been performed to firoduce the classmap 
contained within the file, and the Derived from: header is similarly also for di^J^y purposes only. 
Any number of headers may be present; the header section is terminated by findirLg~^ line beginning 
with the four characters making up Word. The colon-terminated headers may be i;^Any order. 

CLASS<int> must be the name of a class in the classmap (technically actually'^e wordmap) 
used to build the class-given-class history n-gram component of the language model - the file built 
by LBuiLD. In the current implementation these class names are restricted to being of the form 
CLASS<int>, although a modification to the code in LModel.C would allow this restriction to be 
removed. Each <f loat> specifies the natural logarithm of the probability of the word given the 
class, or -99.9900 if the probabiUty of the word is less than 1.0 x lO^-^*^. 



16.8.3 The class LM three file format 



A special class language model file, generated by LLiNK, links together either the word-given-class 
probability or count files described above (either can be used to give the same results) with a class- 
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given-class history n-gram file constructed using LBuild. It is a simple text file which specifies the 
filename of the two relevant components: 
Class-based LM 

WordlClass counts: <file> or WordlClass probabilities: <file> 
Class I Class grains: <file> 

The second line must state counts or probabilities as appropriate for the relevant file. 



16.8.4 The class LM single file format 

An alternative to the^i^ss language model file described in section 16.8.3 is the composite single-file 
class language model produced by LLink -s - this does not require the two component files to 
be present since it inte^Mes them into a single file. The format of this resulting file is as follows: 

CLASS MODEL , 

WordlClass <string: cojuyjbs/probs> 

Derived from: <file> \ 
Number of classes: <int> >^ i 
Number of words: <int> 
Iterations: <int> 

Class n-gram counts follow; wo^^lj^lass component is at end of file. 

The second line must state either countsN^ probabilities as appropriate for the relevant com- 
ponent file used when constructing this comf5o)iite file. The fields are mostly self-explanatory. The 
Iterations: header is for information only ajidvrecords how many iterations had been performed 
to produce the classmap contained within the me^nd the Derived from: header is similarly also 
for display purposes only. Any number of headers may be present; the header section is terminated 
by finding a line beginning with the five charactersN^aking up Class. The colon-terminated headers 
may be in any order. 

The class-given-classes n-gram component of the model then follows immediately in any of the 
formats supported by word n-gram language models - re,^^ose described in section 16.7. No blank 
lines are expected between the header shown above and th&ancluded model, although they may be 
supported by the embedded model. \V 

Immediately following the class-given-classes n-gram compcment follows the body of the word- 
given-class probabilities or counts file as described in sections''rfD)8.1 and 16.8.2 above. That is, the 
remainder of the file consists of lines of the form: 

<word> CLASS<int> <float/int> O 

One line is expected for each word as specified in the header at t'i^^iop of the file. Integer word 
counts should be provided in the final field for each word in the case OMicounts file, or word-given- 
class probabilities if a probabilities file - as specified by the second Uti^^f the overall file. In the 
latter case each <float> specifies the natural logarithm of the probabilj^of the word given the 
class, or -99.9900 if the probabihty of the word is less than 1.0 x 10~^°. 

CLASS<int> must be the name of a class in the classmap (technically actually the wordmap) 
used to build the class-given-class history n-gram component of the language 'nipdel ~ the file built 
by LBuild. In the current implementation these class names are restricted ts«.l?eing of the form 
CLASS<int>, although a modification to the code in LModel.C would allow thi^^striction to be 
removed. 

16.9 Language modelling tracing 

Each of the HTK language modelling tools provides its own trace facilities, as documented with the 
relevant tool in chapter 17. The standard libraries also provide their own trace settings, which can 
be set in a passed configuration file. Each of the supported trace levels is documented below with 
the octal value necessary to enable it. 



16.9 Language modelling tracing 

16.9.1 LCMap 

• 0001 Top level tracing 

• 0002 Class map loading 

16.9.2 LGBase 

• 0001 Top level tracing 

• 0002 Trace n-g^^n squashing 

• 0004 Trace n-gra^Jsuffer sorting 

• 0010 Display rt-gra^fnput set tree 

• 0020 Display maximun^^pjirallel input streams 

• 0040 Trace parallel inpu^^^eaming 

• 0100 Display information input / output 

16.9.3 LModel ^ 

• 0001 Top level tracing 

• 0002 Trace loading of language mod^ij^ 

• 0004 Trace saving of language models ''^^^^ 

• 0010 Trace word mappings 

• 0020 Trace n-gram lookup 

<^ 

16.9.4 LPCalc <y> 



• 0001 Top level tracing 

• 0002 FoF table tracing 



(J) 



'•6 

16.9.5 LPMerge 

• 0001 Top level tracing Q 



16.9.6 LUtil 

• 0001 Top level tracing 

• 0002 Show header processing 

• 0004 Hash table tracing 

16.9.7 LWMap 

• 0001 Top level tracing 

• 0002 Trace word map loading 

• 0004 Trace word map sorting 
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16.10 Run-time configuration parameters 

Section 4.10 lists the major standard HTK configuration parameter options whilst the rest of chapter 
4 describes the general HTK environment and how to set those configuration parameters, whilst 
chapter 18 provides a comprehensive list. For ease of reference those parameters specifically relevant 
to the language modelling tools are reproduced in table 16.1. 



Module 



HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 
HShell 

HShell 

HMem 



ABDRTONERR 
^LANGMDDFILTER 
^ABELFILTER 
ifejJTFILTER 
LG^FILTER 
LWMAJ'FILTER 
LCMAPFn.TER 
HLANGMlW^ILTER 
HLABELOMLTER 
HDICTOFl£^R> 
LGRAMOFILT^ 
LWMAPOFILTER(V) 
LCMAPOFILTER /r\ 
MAXTRYOPEN \i< 
NONUMESCAPES 
NATURALREADORDER 



LWMap 

LWMap 

LCMap 

LCMap 

LCMap 
LCMap 

LPCalc 
LPCalc 
LPCalc 
LPCalc 

LGBase 



Name 



NATURALWRITEDRDER 

PRDTECTSTAKS 

TRACE 

STARTWORD 

ENDWORD 

UNKNOWNNAME 

RAWMITFDRMAT 

INWMAPRAW 

OUTWMAPRAW 

INCMAPRAW 

OUTCMAPRAW 

UNKNOWNID 
USEINTID 

UNIFLOOR 
KRANGE 
nG_CUTOFF 
DCTYPE 

CHECKORDER 



Description 




Core dump on error (for debugging) 
Filter for language model file input 
Filter for Label file input 
Filter for Dictionary file input 
Filter for gram file input 
Filter for word map file input 
Filter for class map file input 
Filter for language model file output 
Filter for Label file output 
Filter for Dictionary file output 
Filter for gram file output 
Filter for word map file output 
Filter for class map file output 
Number of file open retries 
Prevent string output using \012 format 
Enable natural read order for HTK binary 
s 

le natural write order for HTK bi- 
nglrY-^les 

Warn if gtack is cut-back (debugging) 
Trace<control (default=0) 
Set sen^gnce start symbol (<s>) 
Set sente(K3 end symbol (</s>) 
symbol ( ! ! UNK) 
caping for LM tools 
fcaping for input word lists 



Set GOV 
Disable HTl 
Disable HTl 
and maps 

Disable HTK e^^a^ing for output word 
lists and maps 

Disable HTK escaping^or input class lists 
and maps 

Disable HTK escapilvg'^r output class 
lists and maps 

Set unknown symbol clas^''^ (1) 

Use 4 byte ID fields to save IjiB^ry models 

(see section 16.10.1) 

Unigram floor count (1) ^ 

Good- Turing discounting range f7) 

n-gram cutoff (eg. 2G_CUT0FF) (l)Q- 

Discounting type (TG for Turing-Go^^ or 

ABS for Absolute) (TG) 

Check N-gram ordering in files <''^ 



Table. 16.1 Configuration Parameters used in Operating Environment 



16.10.1 USEINTID 

Setting this to T as opposed to its default of F forces the LMODEL library to save language models 
using an unsigned int for each word ID as opposed to the default of an unsigned short. In most 
systems these lengths correspond to 4-byte and 2-byte fields respectively. Note that if you do not 
set this that LModel will automatically choose an int field size if the short field is too small - the 



16.11 Compile-time configuration parameters 



231 



exception to this is if you have compiled with LM_ID_SHORT which hmits the field size to an unsigned 
short, in which case the tool will be forced to abort; see section 16.11.1 below. 



16.11 Compile-time configuration parameters 

There are some compile-time switches which may be set when building the language modelling 
library and tools. 



16.11.1 LM ID^HORT 



When compiling the H'^^language modelling library, setting LM_ID_SHDRT (for example by passing 
-D LM_ID_SHDRT to the O^gpipiler) forces the compiler to use an unsigned short for each language 
model ID it stores, as opposed to the default of an unsigned int - in most systems this will result 
in either a 2- byte integer or ^Arhyte integer respectively. If you set this then you must ensure you 
also set LM_ID_SHORT when cWijriiling the HTK language modelling tools too, otherwise you will 
encounter a mismatch leading t^^trange results! (Your compiler may warn of this error, however). 
For this reason it is safest to setC^J|D_SHDRT via a #define in LModel.h. You might want to set 
this if you know how many distinet^ord ids you require and you do not want to waste memory, 
although on some systems using shOTte>can actually be slower than using a full-size int. 

Note that the run-time USEINTIDT^atameter described in section 16.10.1 above only affects the 
size of ID fields when saving a binary larodel from LModel, so is independent of LM_ID_SHORT. The 
only restriction is that you cannot load oi^ave a model with more ids than can fit into an unsigned 
short when LM_ID_SHORT is set - the tools^ml abort with an error should you try this. 

\> 

16.11.2 LM.COMPACT \^ 

When LM_COMPACT is defined at compile time, wh^p^a language model is loaded then its probabilities 
are compressed into an unsigned short as opposef^o being loaded into a float. The exact size of 
these types depends on your processor architecture, ^^^t in general an unsigned short is more than 
half as small as a float. Using the compact storage type-therefore significantly reduces the accuracy 
with which probabilities are stored. 

The side effect of setting this is therefore reduced accuracy when running a language model, 
such as when using LPlex; or a loss of accuracy when re l@)lding from an existing language model 
using LMerge, LAdapt, LBuild or HLMCopy. , 

16.11.3 LMPROB SHORT y> 

Setting LMPROB_SHORT causes language model probabilities to be©bred and loaded using a short 
type. Unlike LM_COMPACT, this option certainly does affect the wri@g of language model files. If 



you save a file using this format then you must ensure you reload it ii^Jj^e same way to ensure you 
obtain sensible results. V"^^ 

16.11.4 INTERPOLATE_MAX 

If the library and tools are compiled with INTERPOLATE_MAX then language model interpolation 
in LPlex and the LPMerge library (which is used by LAdapt and LMerg«) will ignore the 
individual model weights and always pick the highest probability from each or^ne models at any 
given point. Note that this option will not normalise the models. 



16.11.5 SANITY 



Turning on SANITY when compiling the library will add a word map check to LGBase and some 
sanity checks to LPCalc. 



16.11.6 INTEGRITY.CHECK 

Compiling with INTEGRITY_CHECK will add run-time integrity checking to the CLUSTER tool. Specif- 
ically it will check that the class counts have not become corrupted and that all maximum likelihood 
move updates have been correctly calculated. You should not need to enable this unless you suspect 
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a major tool problem, and doing so will slow down the tool execution. It could probe useful if you 
wanted to adapt the way the clustering works, however. 
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The H'A Tools 



(J) 



•1^ 

o 



( 



234 



17.1 



Cluster 



235 



17.1 Cluster 
17.1.1 Function 

This program is used to statistically cluster words into deterministic classes. The main purpose of 
Cluster is to optimise a class map on the basis of the training text likelihood, although it can 
also import an existing class map and generate one of the files necessary for creating a class-based 
language model from the HTK language modelling tools. 

Class-based language models use a reduced number of classes relative to the number of words, 
with each class contaigSng one or more words, to allow a language model to be able to generalise to 
unseen training contew^ Class-based models also typically require less training text to produce a 
well-trained model thanZs^similar complexity word model, and are often more compact due to the 
much reduced number of^jft^ssible distinct history contexts that can be encountered in the training 
data. ^ 

Cluster takes as input £t^t of one or more training text gram files, which may optionally be 
weighted on input, and their"ag^ciated word map. It then clusters the words in the word map 
into classes using a bigram likei^ood measure. Due to the computational complexity of this task a 
sub-optimal greedy algorithm is ^d, but multiple iterations of this algorithm may be performed 
in order to further refine the clasgv^^p, although at some point a local maximum will be reached 
where the class map will not change ^vther.^ In practice as few as two iterations may be perfectly 
adequate, even with large training datarsets. 

The algorithm works by considering^ach word in the vocabulary in turn and calculating the 
change in bigram training text likelihoody^ the word was moved from its default class (see below) 
to each other class in turn. The word is th^ moved to the class which increases the likelihood the 
most, or it is left in its current class if no stt^ increase is found. Each iteration of the algorithm 
considers each word exactly once. Because thismn be a slow process, with typical execution times 
measured in terms of a few hours, not a few minutes, the Cluster tool also allows recovery files to 
be written at regular intervals, which contain this current class map part- way through an iteration 
along with associated files detailing at what poiJsf in the iteration the class map was exported. 
These files are not essential for operation, but mig ht(^ desirable if there is a risk of a long-running 
process being killed via some external infiuence. Duringrjie execution of an iteration the tool claims 
no new memory,^ so it cannot crash in the middle of Em^i^i^ration due to a lack of memory (it can, 
however, fail to start an iteration in the first place). ^ _ 

Before beginning an iteration. Cluster places eachvword either into a default class or one 
specified via the -1, import classmap, or -x, use recovery, opticms. The default distribution, given 
m classes, is to place the most frequent (m — 1) words into singleton classes and then the remainder 
into the remaining class. Cluster allows words to be considere^^ either decreasing frequency of 
occurrence order, or the order they are encountered in the word xwtp. The popular choice is to use 
the former method, although in experiments it was found that tne-rnore random second approach 
typically gave better class maps after fewer iterations in practice^^'.JThe -w option specifies this 
choice. \^ 

During execution Cluster will always write a logfile describing ,fl^changes it makes to the 
classmap, unless you explicitly disable this using the -n option. If the -y^Mvitch is used then this 
logfile is written in explicit English, allowing you to easily trace the exBcjj^ion of the clusterer; 
without -V then similar information is exported in a more compact format .\ 

Two or three special classes are also defined. The sentence start and senttnce end word tokens 
are always kept in singleton classes, and optionally the unknown word toke^^n be kept in a 
singleton class too - pass the -k option.'' These tokens are placed in these class^fi^n initialisation 
and no moves to or from these classes are ever considered. 



:t^\ 



Language model files are built using either the -p or -q options, which are effectijmly equivalent 
if using the HTK language modelling tools as black boxes. The former creates a word- given- 
class probabilities file, whilst the latter stores word counts and lets the language model code itself 
calculate the same probabilities. 



^On a 65,000 word vocabulary test set with 170 million words of training text this was found to occur after around 
45 iterations 

^other than a few small local variables taken from the stack as functions are called 

^Note that these schemes are approximately similar, since the most frequent words arc most likely to be encoun- 
tered sooner in the training text and thus occur higher up in the word map 

*The author always uses this option but has not empirically tested its efficaciousness 
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17.1.2 Use 

Cluster is invoked by the command line 

Cluster [options] mapfile [mult] grsmifile [[mult] gramfile ...] 

The given word map is loaded and then each of the specified gram files is imported. The list of 
input gram files can be interspersed with multipliers. These are floating-point format numbers 
which must begin with a plus or minus character (e.g. +1.0, -0.5, etc.). The effect of a multiplier 
mult is to scale the n-gram counts in the following gram files by the factor mult. The resulting 
scaled counts are rou^^d to the nearest integer when actually used in the clustering algorithm. A 
multiplier stays in effeq^Arntil it is redefined. 

The allowable optioffipv) Cluster are as follows 

-c n Use n classes. This specifies the number of classes that should be in the resultant class map. 

-i n Perform n iterations, the number of iterations of the clustering algorithm that should 

be performed. (If you arS^pl^ing the -x option then completing the current iteration does not 
count towards the total nu(fi^er, so use -i 0 to complete it and then finish) 

-k Keep the special unknown w^^token in its own singleton class. If not passed it can be moved 
to or from any class. ^) 

-1 fn Load the classmap fn at start ^Sjand when performing any further iterations do so from 
this starting point. ^ 

-m Record the running value of the ma^ttam likelihood function used by the clusterer to op- 
timised the training text likelihood invfiie log file. This option is principally provided for 
debugging purposes. ^ 

. . 

-n Do not write any log file during execution o^^ iteration. 

-o f n Specify the prefix of all output files. All oui(^t class map, logfile and recovery files share 
the same filename prefix, and this is specified vi^^e -o switch. The default is cluster. 

-p fn Write a word-given-class probabilities file. Eit^^ this or the -q switch are required to 
actually build a class-based language model. The language model library, LModel, 

supports both probability and count-based class files, ^here is no difference in use, although 
each allows different types of manual manipulation of tja^file. Note that if you do not pass 
-p or -q you may run CLUSTER at a later date using the^-1 apd -i 0 options to just produce 
a language model file. \ 

.o 

-q f n Write a word-given-class counts file. See the documentation^|^ -p. 

-r n Write recovery files after moving n words since the previous r^^^ery file was written or an 
iteration began. Pass -r n to disable writing of recovery files, 

-s tkn Specify the sentence start token. 

-t tkn Specify the sentence end token. 

-u tkn Specify the unknown word token. ^\ 
-V Use verbose log file format, o 



-w [WMAP/FREQ] Specify the order in which word moves are considered. Default isMrfMAP in which 
words are considered in the order they are encountered in the word map. Specifying FREQ will 
consider the most frequent word first and then the remainder in decreasing order of frequency. 

-X f n Continue execution from recovery file fn. 



Cluster also supports the standard options -A, -C, -D, -S, -T, and -V as described in section 4.4. 
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17.1.3 Tracing 



Cluster supports the following trace options, where each trace flag is given using an octal base: 

00001 basic progress reporting. 

00002 report major file operations - good for following start-up. 
00004 more detailed progress reporting. 
00010 trace memory upage during execution and at end. 



Trace flags are set usin^he -T option or the TRACE configuration variable. 





CO 
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17.2 HBuild 

17.2.1 Function 

This program is used to convert input files that represent language models in a number of different 
formats and output a standard HTK lattice. The main purpose of HBuild is to allow the expansion 
of HTK multi-level lattices and the conversion of bigram language models (such as those generated 
by HLStats) into lattice format. 

The specific input file types supported by HBuild are: 

1. HTK multi-leve^^ttice files. 



2. Back-off bigram fili^in ARPA/MIT-LL format. 

3. Matrix bigram files produced by HLStats. 

4. Word lists (to generate^^word-loop grammar). 



5. Word-pair grammars in A^PA Resource Management format. 

The formats of both types o m supported by HBuiLD are described in Chapter 12. The 

format for multi-level HTK lattice fiies is described in Chapter 20. 

17.2.2 Use 

HBuild is invoked by the command linev^ 

HBuild [options] wordList outLatFj^^ 
The wordList should contain a list of all the wcSjds used in the input language model. The options 



specify the type of input language model as well E^s^e source filename. If none of the flags specifying 
input language model type are given a simple word4oop is generated using the wordList given. 
After processing the input language model, the resultuag lattice is saved to file outLatFile. 
The operation of HBuild is controlled by the following command line options 

-b Output the lattice in binary format. This inc^^es speed of subsequent loading (default 
ASCII text lattices). O 

-m f n The matrix format bigram in f n forms the input lahguage model. 

-n f n The ARPA/MIT-LL format back-off bigram in fn fori^^^the input language model. 

-s St en Set the bigram entry and exit words to st and en. ^fault ! ENTER and !EXIT). Note 
that no words will follow the exit word, or precede the enti^^ord. Both the entry and exit 
word must be included in the wordList. This option is only effective in conjunction with the 
-n option. ^^j^ 

-t St en This option is used with word-loops and word-pair grampus. An output lattice is 
produced with an initial word-symbol st (before the loop) and a fii7ai\ word-symbol en (after 
the loop). This allows initial and final silences to be specified. (Drag»rilt is that the initial 
and final nodes are labelled with !NULL). Note that st and en shoulan't be included in the 
wordList unless they occur elsewhere in the network. This is only effective for word-loop and 
word-pair grammars. 

-u s The unknown word is s (default !NULL). This option only has an effect wh«1i bigram input 
language models are specified. It can be used in conjunction with the -z RatgXo delete the 
symbol for unknown words from the output lattice. 

-w fn The word-pair grammar in fn forms the input language model. The file must be in the 
format used for the ARPA Resource Management grammar. 

-X fn The extended HTK lattice in fn forms the input language model. This option is used to 
expand a multi-level lattice into a single level lattice that can be processed by other HTK 
tools. 

-z Delete (zap) any references to the unknown word (see -u option) in the output lattice. 
HBuild also supports the standard options -A, -C, -D, -S, -T, and -V as described in section 4.4. 
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17.2.3 Tracing 

HBuild supports the following trace options where each trace flag is given using an octal base 
0001 basic progress reporting. 

Trace flags are set using the -T option or the TRACE configuration variable. 
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17.3 HCompV 

17.3.1 Function 

This program will calculate the global mean and covariance of a set of training data. It is primarily 
used to initialise the parameters of a HMM such that all component means and all covariances 
are set equal to the global data mean and covariance. This might form the first stage of a flat 
start training scheme where all models are initially given the same parameters. Alternatively, the 
covariances may be used as the basis for Fixed Variance and Grand Variance training schemes. 
These can sometimeSyfee beneficial in adverse conditions where a fixed covariance matrix can give 
increased robustness .^yi 

When training largeZ^todel sets from limited data, setting a floor is often necessary to prevent 
variances being badly urf^xestimated through lack of data. One way of doing this is to define a 
variance macro called varFJoorN where N is the stream index. HCompV can also be used to create 
these variance floor macros wi^ values equal to a specified fraction of the global variance. 

Another application of HCjai^pV is the estimation of mean and variance vectors for use in 
cluster-based mean and varianc^normalisation schemes. Given a list of utterances and a speaker 
pattern HCompV will estimate ^ef m and a variance for each speaker. 



17.3.2 Use 

HCompV is invoked via the comman 

HCompV [options] [hmm] trainFilS^^*. . 

where hmm is the name of the physical HM?^ whose parameters are to be initialised. Note that 
no HMM name needs to be specified when c^^tral mean or variance vectors are estimated (-c 
option) . The effect of this command is to compu^ the covariance of the speech training data and 
then copy it into every Gaussian component of xha given HMM definition. If there are multiple 
data streams, then a separate covariance is estimared for each stream. The HMM can have a mix of 
diagonal and full covariances and an option exists t(\update the means also. The HMM definition 
can be contained within one or more macro files loade^^ia the standard -H option. Otherwise, the 
definition will be read from a file called hmm. Any tyin^^n the input definition will be preserved 
in the output. By default, the new updated definition oveirourites the existing one. However, a new 
definition file including any macro files can be created by specifying an appropriate target directory 
using the standard -M option. • > 

In addition to the above, an option -f is provided to com^jyjje variance fioor macros equal to a 
specified fraction of the global variance. In this case, the newly ^p^ted macros are written to a file 
called vFloors. For each stream N defined for hmm, a variance macrtTvcalled varFloorN is created. If 
a target directory is specified using the standard -M option then tnB-«ew file will be written there, 
otherwise it is written in the current directory. 

The list of train files can be stored in a script file if required. ;rmore, the data used for 

estimating the global covariance can be limited to that correspondingvtq_^ specified label. 

The calculation of cluster-based mean and variances estimates is enabl^dSby the option -c which 
specifies the output directory where the estimated vectors should be storeSO^ 

The detailed operation of HCompV is controlled by the following command line options 

-c s Calculate cluster-based mean/ variance estimate and store results in the ^^^ified directory. 

-k s Speaker pattern for cluster-based mean/variance estimation. Each utteG^e filename is 
matched against the pattern and the characters that are matched against "/o^^ra used as the 
cluster name. One mean/variance vector is estimated for each cluster. 

-p s Path pattern for cluster-based mean/variance estimation. Each utterance filename is matched 
against the pattern and the characters that are matched against 7, are spliced to the end of 
the directory string specified with option '-c' for the final mean/variance vectors output. 

-q s For cluster-based mean/variance estimation different types of output can be requested. Any 
subset of the letters nmv can be specified. Specifying n causes the number of frames in a 
cluster to be written to the output file, m and v cause the mean and variance vectors to be 
included, respectively. 
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-f f Create variance floor macros with values equal to f times the global variance. One macro is 
created for each input stream and the output is stored in a file called vFloors. 

-1 s The string s must be the name of a segment label. When this option is used, HCompV 
searches through all of the training files and uses only the speech frames from segments with 
the given label. When this option is not used, HCompV uses all of the data in each training 
file. 

-m The covariances of the output HMM are always updated however updating the means must 
be specifically rMuested. When this option is set, HCompV updates all the HMM component 
means with the setoiple mean computed from the training files. 

-o s The string s is use^^s the name of the output HMM in place of the source name. 

-V f This sets the minimu^n variance (i.e. diagonal elements of the covariance matrix) to the real 
value f (default value (^^. 

-B Output HMM definition ^s in binary format. 

-F fmt Set the source data fornS^^ fmt. 

-G fmt Set the label file format to 



-H mmf Load HMM macro model file rmt. This option may be repeated to load multiple MMFs. 

-I mlf This loads the master label file rfO^This option may be repeated to load several MLFs. 

-L dir Search directory dir for label files (id^Jault is to search current directory). 

-M dir Store output HMM macro model files^iS^^e directory dir. If this option is not given, the 
new HMM definition will overwrite the exJ^tmg one. 

V>' 

-X ext Set label file extension to ext (default is 3-3-^)^ 

HCompV also supports the standard options -A, -C, -^[^S, -T, and -V as described in section 4.4. 
17.3.3 Tracing 

HCompV supports the following trace options where each tracaflag is given using an octal base 

00001 basic progress reporting. ^-^^ 

00002 show covariance matrices. 

00004 trace data loading. \^ 
00010 list label segments. '<X^ 
Trace flags are set using the -T option or the TRACE conflguration variabl^^ 

o 

% 
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17.4 HCopy 

17.4.1 Function 

This program will copy one or more data files to a designated output file, optionally converting the 
data into a parameterised form. While the source files can be in any supported format, the output 
format is always HTK. By default, the whole of the source file is copied to the target but options 
exist to only copy a specified segment. Hence, this program is used to convert data files in other 
formats to the HTK format, to concatenate or segment data files, and to parameterise the result. 
If any option is set w^ch leads to the extraction of a segment of the source file rather than all of 
it, then segments will extracted from all source files and concatenated to the target. 

Labels will be copi^^oncatenated if any of the options indicating labels are specified (-i -1 
-X -G -I -L -P -X). In'^liis case, each source data file must have an associated label file, and 
a target label file is created. The name of the target label file is the root name of the target 
data file with the extension.- Aab, unless the -X option is used. This new label file will contain 
the appropriately copied/trunrai^d/concatenated labels to correspond with the target data file; all 
start and end boundaries are recalculated if necessary. 

When used in conjunction w JSLab, HCopy provides a facility for tasks such as cropping 
silence surrounding recorded utterst^es. Since input files may be coerced, HCopy can also be used 
to convert the parameter kind of a for example from WAVEFORM to MFCC, depending on 
the configuration options. Not all poBaible conversions can actually be performed; see Table 17.1 
for a list of valid conversions. ConversimiS must be specified via a configuration file as described 
in chapter 5. Note also that the paramet^isation qualifier _N cannot be used when saving files to 
disk, and is meant only for on-the-fiy pargii^^terisation. 



17.4.2 Use \^ 

HCopy is invoked by typing the command line 

HCopy [options] sal [ + sa2 + . . . ] ta^'^'^bl [ + sb2 + ... ] tb ... ] 

This causes the contents of the one or more source fi^s>sal, sa2, . . . to be concatenated and the 
result copied to the given target file ta. To avoid the overbad of reinvoking the tool when processing 
large databases, multiple sources and targets may be spec^j^d, for example 

HCopy srcA.wav + srcB.wav tgtAB.wav srcC.wav tgtD.jjav 

will create two new files tgtAB.wav and tgtD.wav. HCoP^^ates file arguments from a script 
specified using the -S option exactly as from the command ikrC except that any newlines are 
ignored. Q 

The allowable options to HCopy are as follows where all times /msd durations are given in 100 
ns units and are written as fioating-point numbers. ^^-^"^ 

-a i Use level i of associated label files with the -n and -x options. >^eJ^e)that this is not the same 
as using the TRANSLEVEL configuration variable since the -a option ^tSl allows all levels to be 
copied through to the output files. 

-e f End copying from the source file at time f . The default is the end of th^ file. If f is negative 
or zero, it is interpreted as a time relative to the end of the file, while a positive value indicates 
an absolute time from the start of the file. 

-i mlf Output label files to master file mlf . 

-1 s Output label files to the directory s. The default is to output to the current directory. 

-m t Set a margin of duration t around the segments defined by the -n and -x options. 

-n i [j] Extract the speech segment corresponding to the i'th label in the source file. If j is 
specified, then the segment corresponding to the sequence of labels i to j is extracted. Labels 
are numbered from their position in the label file. A negative index can be used to count from 
the end of the label list. Thus, -n 1 -1 would specify the segment starting at the first label 
and ending at the last. 



-s f Start copying from the source file at time f . The default is 0.0, ie the beginning of the file. 
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-t n Set the line width to n chars when formatting trace output. 

-X s [n] Extract the speech segment corresponding to the first occurrence of label s in the source 
file. If n is specified, then the n'th occurrence is extracted. If multiple files are being con- 
catenated, segments are extracted from each file in turn, and the label must exist for each 
concatenated file. 



-F 


fmt 


-G 


fmt 


-I 


mlf 


-L 


dir 


-0 


fmt 


-P 


fmt 


-X 


ext 



Set the source data format to fmt. 



for label files (default is to search current directory). 



t to fmt. 



Set label file extension t^^xt (default is lab). 

HCoPY also supports the standar^^itions -A, -C, -D, -S, -T, and -V as described in section 4.4. 

Note that the parameter kind coniJersion mechanisms described in chapter 5 will be applied to 
all source files. In particular, if an a^)matic conversion is requested via the configuration file, 
then HCoPY will copy or concatenate th^^nverted source files, not the actual contents. Similarly, 
automatic byte swapping may occur dependjil^ on the source format and the configuration variable 
BYTEORDER. Because the sampling rate may cbange during conversions, the options that specify a 
position within a file i.e. -s and -e use absolurtelimes rather than sample index numbers. All times 
in HTK are given in units of 100ns and are wrrtjenas fioating-point numbers. To save writing long 
strings of zeros, standard exponential notation i(^ay be used, for example -s 1E6 indicates a start 
time of 0.1 seconds from the beginning of the file.y^* 

% 




WAVEFORM 
LPC 
LPREFC 
LPCEPSTRA 
IREFC 
MFCC 
FBANK 
MELSPEC 
USER 
DISCRETE 
PLP 
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R E 



V 

V V 

V V V 



Table. 17.1 Valid Parameter Conversions 



Note that truncations are performed after any desired coding, which may result in a loss of time 
resolution if the target file format has a lower sampling rate. Also, because of windowing effects, 
truncation, coding, and concatenation operations are not necessarily interchangeable. If in doubt, 
perform all truncation/concatenation in the waveform domain and then perform parameterisation 
as a last, separate invocation of HCopy. 
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17.4.3 Trace Output 



HCoPY supports the following trace options where each trace flag is given using an octal base 

00001 basic progress reporting. 

00002 source and target file formats and parameter kinds. 
00004 segment boundaries computed from label files. 
00010 display memory usage after processing each file. 



Trace flags are set usin^he -T option or the TRACE configuration variable. 
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17.5 HDMan 

17.5.1 Function 

The HTK tool HDMan is used to prepare a pronunciation dictionary from one or more sources. It 
reads in a list of editing commands from a script file and then outputs an edited and merged copy 
of one or more dictionaries. 

Each source pronunciation dictionary consists of comment lines and definition lines. Comment 
lines start with the # character (or optionally any one of a set of specified comment chars) and 
are ignored by HDM^. Each definition line starts with a word and is followed by a sequence of 
symbols (phones) thain^fine the pronunciation. The words and the phones are delimited by spaces 
or tabs, and the end oiZ^ delimits each definition. 

Dictionaries used by '^^Man are read using the standard HTK string conventions (see sec- 
tion 4.6), however, the coi|jimand IR can be used in a HDMan source edit script to switch to 
using this raw format. Note, rijat in the default mode, words and phones should not begin with 
unmatched quotes (they shoiiWfe^ escaped with the backslash) . All dictionary entries must already 
be alphabetically sorted before ^ing HDMan. 

Each edit command in the sc^^ file must be on a separate line. Lines in the script file starting 
with a # are comment lines and ai^^^nored. The commands supported are listed below. They can 
be displayed by HDMan using the -/^pption. 

When no edit files are specified, HDMan simply merges all of the input dictionaries and outputs 
them in sorted order. All input dictioHMaes must be sorted. Each input dictionary xxx may be 
processed by its own private set of edit cofamands stored in xxx . ded. Subsequent to the processing 
of the input dictionaries by their own uniJ^w^ edit scripts, the merged dictionary can be processed 
by commands in global. ded (or some otheir^ecified global edit file name). 

Dictionaries are processed on a word by wordffeasis in the order that they appear on the command 
line. Thus, all of the pronunciations for a given woprf are loaded into a buffer, then all edit commands 
are applied to these pronunciations. The result is then output and the next word loaded. 

Where two or more dictionaries give pronunciSfuons for the same word, the default behaviour 
is that only the first set of pronunciations encountei^$ are retained and all others are ignored. An 
option exists to override this so that all pronunciatioi^are concatenated. 

Dictionary entries can be filtered by a word list sui^ that all entries not in the list are ig- 
nored. Note that the word identifiers in the word list snould match exactly (e.g. same case) their 
corresponding entries in the dictionary. \V 

The edit commands provided by HDMan are as follows* ^ 

AS A B ... Append silence models A, B, etc to each pronunCm^n. 

CR X A Y B Replace phone Y in the context of A_B by X. Cont^3)s may include an asterix * to 
denote any phone or a defined context set defined using the D^^ommand. 

DC X A B . . .Define the set A, B, ... as the context X. 

DD X A B . . .Delete the definition for word X starting with phones A, 

DP A B C . . .Delete any occurrences of phones A or B or C . . . . 

DS src Delete each pronunciation from source src unless it is the only one for the current 

word. 

DW X Y Z . . .Delete words (& definitions) X, Y, Z, 

FW X Y Z . . .Define X, Y, Z, . . . as function words and change each phone in the definition to a 



re tlef 

function word specific phone. For example, in word W phone A would become W.A. 



IR Set the input mode to raw. In raw mode, words are regarded as arbitrary sequences 

of printing chars. In the default mode, words are strings as defined in section 4.6. 

LC [X] Convert all phones to be left-context dependent. If X is given then the 1st phone a 

in each word is changed to X-a otherwise it is unchanged. 

LP Convert all phones to lowercase. 

LW Convert all words to lowercase. 
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MP X A B . . .Merge any sequence of phones A B . . . and rename as X. 

RC [X] Convert all phones to be right-context dependent. If X is given then the last phone z 

in each word is changed to z+X otherwise it is unchanged. 

RP X A B . . .Replace all occurrences of phones A or B ... by X. 

RS system Remove stress marking. Currently the only stress marking system supported is that 
used in the dictionaries produced by Carnegie Melon University (system = emu). 

RW X A B . . .Replac^ll occurrences of word A or B ... by X. 

SP X A B . . .Spht phc^X into the sequence ABC 

TC [X [Y] ] Convert ph^ies to triphones. If X is given then the first phone a is converted to X-a+b 
otherwise it is uncharjged. If Y is given then the last phone z is converted to y-z+Y otherwise 
if X is given then it is c^^ged to y-z+X otherwise it is unchanged. 

UP Convert all pho^f^to uppercase. 

UW Convert all words Qi^uBpercase. 

V 

17.5.2 Use O 

HDMan is invoked by typing the com^^jad line 

HDMan [options] newDict srcDictiy^^S:'cDict2 . . . 

This causes HDMan read in the source dicf^^aries srcDictl, srcDict2, etc. and generate a new 
dictionary newDict. The available options are^^ 



-a s Each character in the string s denotes the^start of a comment line. By default there is just 
one comment character defined which is #. 

-b s Define s to be a word boundary symbol. v . 

-e dir Look for edit scripts in the directory dir. 

-g f File f holds the global edit script. By default, HD^^^N expects the global edit script to be 
called global. ded. ^ 

-h i j Skip the first i lines of the j 'th listed source dictiona^^ 

-i Include word output symbols in the output dictionary. ^^"^^ 

-j Include pronunciation probabilities in the output dictionary. ^^^^^ 

-1 s Write a log file to s. The log file will include dictionary statisti^^nd a list of the number of 
occurrences of each phone. ^""^^ 

-m Merge pronunciations from all source dictionaries. By default, H]S)[AN generates a single 
pronunciation for each word. If several input dictionaries have promnciations for a word, 
then the first encountered is used. Setting this option causes all distinct^ronunciations to be 
output for each word. 

-n f Output a list of all distinct phones encountered to file f . 

-o Disable dictionary output. 

-p f Load the phone list stored in file f . This enables a check to be made that all output phones 
are in the supplied list. You need to create a log file (-1) to view the results of this check. 

-t Tag output words with the name of the source dictionary which provided the pronunciation. 

-w f Load the word list stored in file f . Only pronunciations for the words in this list will be 
extracted from the source dictionaries. 

-Q Print a summary of all commands supported by this tool. 

HDMan also supports the standard options -A, -C, -D, -S, -T, and -V as described in section 4.4. 
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17.5.3 Tracing 



HDMan supports the following trace options where each trace flag is given using an octal base 

00001 basic progress reporting 

00002 word buffer operations 
00004 show valid inputs 
00010 word level editmg 



00040 print edit scripts 



00020 word level editing in detail 




00100 new phone recording 



00200 pron deletions 




00400 word deletions 




Trace flags are set using the -T opt^ 



or the TRACE configuration variable. 
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17.6 HDecode 

WARNING: In contrast to the rest of HTK, HDecode has been specifically 
written for speech recognition. Known restrictions are: 

• only works for cross-word triphones; 

• sil and sp models are reserved as silence models and are, by default, 
automatically added to the end of all "words" in the pronunciation 
dictionary. 

• lattices genera^d with HDecode must be merged using HLRescore to 
remove dupliccTCe word paths prior to being used for lattice rescoring 
with HDecode cfn^HVite. 

For an example of the^se of HDecode see the Resource Management 
recipe, section 11, in the>sajnples tar-ball that is available for download 

17.6.1 Function ^ 

HDecode is a large vocabulary worc^^cogniser. Similar to HVite, it transcribes speech files 
using a HMM model set and a dictionai;^^ vocabulary) . The best transcription hypothesis will be 
generated in the Master Label File (MIS^) Jormat. Optionally, multiple hypotheses can also be 
generated as a word lattice in the form of XneJJTK Standard Lattice Format (SLF). 

The search space of the recognition proc^is defined by a model based network, produced from 
expanding a supplied language model or a wor^^vel lattice using the dictionary. In the absence of 
a word lattice, a language model must be suppli^^to perform a full decoding. The current version 
of HDecode only supports bigram full decoding. jM^hen a word lattice is supplied, the use of a 
language model is optional. This mode of operation is>Jcnown as lattice rescoring. The acoustic and 
language model scores can be adjusted using the -aWd -s options respectively. In the case where 
the supplied dictionary contains pronunciation probSLflity information, the corresponding scale 
factor can be adjusted using the -r option. Use the -q q^Son to control the type of information to 
be included in the generated lattices. 

HDecode, when compiled with the MDDALIGN compile^ directive, can also be used to align the 
HMM models to a given word level lattice (also known a^ mtedel marking the lattice). When 
using the default Makefile supplied with HDecode, this binary will be made and stored in HDe- 
code. mod. 

HDecode supports shared parameters and appropriately pr^^omputes output probabilities. 
The runtime of the decoding process can be adjusted by changins^e pruning beam width (see 
the -t option), word end beam width (see the -v option) and thMnaximum model pruning (see 
the -u option). HDecode also allows probability calculation to beVeMried out in blocks at the 
same time. The block size (in frames) can be specified using the X^^^Jbption. However, when 
CMLLR adaptation is used, probabilities have to be calculated one fram^t a time (i.e. using -k 
1)''. Speaker adaptation is supported by HDecode only in terms of using SK^eaker specific linear 
adaptation transformation. The use of an adaptation transformation is enablfed using the -m option. 
The path, name and extension of the transformation matrices are specified usmg the -J and the 
file names are derived from the name of the speech file using a mask (see the y;lr option) . Online 
(batch or incremental) adaptations are not supported by HDecode. (3 

Note that for lattices rescoring word lattices must be deterministic. DupliosMkd paths and 
pronunciation variants are not permitted. See HLRescore reference page for information on how 
to produce deterministic lattices. 

17.6.2 Use 

HDecode is invoked via the command line 



HDecode [options] dictFile hmmList testFiles . . . 

^ This is due to the different caching mechanism used in HDecode and the HAdapt module 
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HDecode will then either load a N-gram language model file (-w s) and create a decoding network 
for the test files, which is the full decoding mode, or create a new network for each test file from 
the corresponding word lattice, which is the lattice rescoring mode. When a new network is created 
for each test file the path name of the label (or lattice) file to load is determined from the test file 
name and the -L and -X options described below. 

The hmmList should contain a list of the models required to construct the network from the 
word level representation. 

The recogniser output is written in the form of a label file whose path name is determined from 
the test file name and the -1 and -y options described below. The list of test files can be stored in 
a script file if require^j^hen performing lattice recognition (see -z s option described below) the 
output lattice file contg^s multiple alternatives and the format is determined by the -q option. 

The detailed operatio^of HDecode is controlled by the following command line options 

-a f Set acoustic scale facjfor to f . This factor post-multiplies the acoustic likelihoods from the 
word lattices, (default .^f^ue 1.0). 

-d dir This specifies the dir^f^^ry to search for the HMM definition files corresponding to the 
labels used in the recogniti^) network. 



-h mask Set the mask for determining which transform names are to be used for the input trans- 
forms. 

-i s Output transcriptions to MLF . 

-k i Set frame block size in output prob^^ky calculation for diagonal covariance systems. 

-1 dir This specifies the directory to storet^e output label files. If this option is not used then 
HDecode will store the label files in the^sBjne directory as the data. When output is directed 
to an MLF, this option can be used to aud>^ path to each output file name. In particular, 
setting the option -1 ' * ' will cause a laoel fije named xxx to be prefixed by the pattern 
"*/xxx" in the output MLF file. This is us^iful^or generating MLFs which are independent 
of the location of the corresponding data files. 

-m Use an input transform, (default is off) ^^^^ 

-n i Use i tokens in each state to perform lattice recogn^^n. (default is 32 tokens per state) 

-o s Choose how the output labels should be formatted, s is a string with certain letters (from 
NSCTWMX) indicating binary flags that control formattin»^tions. N normalise acoustic scores 
by dividing by the duration (in frames) of the segment. S E^move scores from output label. 
By default scores will be set to the total likelihood of the segment. C Set the transcription 
labels to start and end on frame centres. By default starr^mes are set to the start time 
of the frame and end times are set to the end time of the ^^)ne. T Do not include times 
in output label files. W Do not include words in output label fi^^when performing state or 
model alignment. M Do not include model names in output labe^,.^^s when performing state 
and model alignment. X Strip the triphone context. ^-^ 

-p f Set the word insertion log probability to f (default 0.0). 

-q s Choose how the output lattice should be formatted, s is a string with'certain letters (from 
ABtvaldmnr) indicating binary flags that control formatting options. A al^a^h word labels to 
arcs rather than nodes. B output lattices in binary for speed, t output nod^^mes. v output 
pronunciation information, a output acoustic likelihoods. 1 output langue^g^model likeli- 
hoods, d output word alignments (if available), m output within word alignmrat durations, 
n output within word alignment likelihoods, r output pronunciation probabilities. 

-r f Set the dictionary pronunciation probability scale factor to f . (default value 1.0). 

-s f Set the grammar scale factor to f . This factor post-multiplies the language model likelihoods 
from the word lattices, (default value 1.0). 

-t f [g] Enable beam searching such that any model whose maximum log probability token falls 
more than the main beam f below the maximum for all models is deactivated. An extra 
parameter g can be specified as the relative beam width. It may override the main beam 
width. 
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-u i Set the maximum number of active models to i. Setting i to 0 disables this limit (default 0). 

-V f [g] Enable word end pruning. Do not propagate tokens from word end nodes that fall more 
than f below the maximum word end likelihood, (default 0.0). An extra parameter g can be 
specified. 

-w s Load language model from s. 

-X ext This sets the extension to use for HMM definition files to ext. 

-y ext This sets the^^^ctension for output label files to ext (default rec). 

-z ext Enable output'^Jattices with extension ext (default off). 

-L dir This specifies th^^rectory to find input lattices. 

-X ext Set the extension fo^^e input lattice files to be ext (default value lat). 



-E dir [ext] Parent transfoi'^^directory and optional extension for parent transforms. The de- 
fault option is that no par^rio transform is used. 

-F fmt Set the source data formaiyto fmt. 

-G fmt Set the label file format to fn^l^ 

-H mmf Load HMM macro model file mirf^^This option may be repeated to load multiple MMFs. 



-J dir [ext] Add directory to the list di ncssible input transform directories. Only one of the 
options can specify the extrension to u^for the input transforms. 

-K dir [ext] Output transform directory ancL-^tional extension for output transforms. The 
default option is that there is no output extension and the current transform directoryis used. 

-P fmt Set the target label format to fmt. 

HDecode also supports the standard options -A, -C, -^^^-S, -T, and -V as described in section 4.4. 
17.6.3 Tracing 

HDecode supports the following trace options where each trag^ flag is given using an octal base 



0001 enable basic progress reporting. 

0002 list observations. C3 
0004 show adaptation process. ^"^5^ 
0010 show memory usage at start and finish. V^^) 
Trace flags are set using the -T option or the TRACE configuration variablO 



o 
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17.7 HERest 

17.7.1 Function 

This program is used to perform a single re-estimation of the parameters of a set of HMMs, or 
hnear transforms, using an embedded training version of the Baum- Welch algorithm. Training data 
consists of one or more utterances each of which has a transcription in the form of a standard label 
file (segment boundaries are ignored). For each training utterance, a composite model is effectively 
synthesised by concatenating the phoneme models given by the transcription. Each phone model 
has the same set of a^umulators allocated to it as are used in HRest but in HERest they are 
updated simultaneousrj^y performing a standard Baum- Welch pass over each training utterance 
using the composite moE^. 

HERest is intended t^jperate on HMMs with initial parameter values estimated by HInit/HRest. 
HERest supports multiply mixture Gaussians, discrete and tied-mixture HMMs, multiple data 
streams, parameter tying within and between models, and full or diagonal covariance matrices. 
HERest also supports tee-rmW^s (see section 7.8), for handling optional silence and non-speech 
sounds. These may be placed ybetween the units (typically words or phones) listed in the tran- 
scriptions but they cannot be u^^ af the start or end of a transcription. Furthermore, chains of 
tee-models are not permitted. 

HERest includes features to alloj^arallel operation where a network of processors is available. 
When the training set is large, it can o/^plit into separate chunks that are processed in parallel on 
multiple machines/processors, consequenttv speeding up the training process. 

Like all re-estimation tools, HERest;^11ows a floor to be set on each individual variance by 
defining a variance floor macro for each d^a stream (see chapter 8). The configuration variable 
VARFLOORPERCENTILE allows the same thingyl^be done in a different way which appears to improve 
recognition results. By setting this to e.g. 20,lhe variances from each dimension are fioored to the 
20th percentile of the distribution of variances loprthat dimensioon. 

HERest supports two specific methods for Hnitilisation of model parameters , single pass re- 
training and 2-model reestimation. ^\ 

Single pass retraining is useful when the paramei^Xsation of the front-end (e.g. from MFCC to 
PLP coefficients) is to be modified. Given a set of wdi^rained models, a set of new models using 
a different parameterisation of the training data can Degenerated in a single pass. This is done 
by computing the forward and backward probabilities usiag the original well-trained models and 
the original training data, but then switching to a new seV of training data to compute the new 
parameter estimates. • > 

In 2-model re- estimation one model set can be used to oht'Jsmyhe forward backward probablilites 
which then are used to update the parameters of another mp^l set. Contrary to single pass 
retraining the two model sets are not required to be tied in th^same fashion. This is particulary 
useful for training of single mixture models prior to decision-tree^-b^sed state clustering. The use 
of 2-model re-estimation in HERest is triggered by setting the co^jg^variables ALIGNMDDELMMF or 
ALIGNMODELDIR and ALIGNMODELEXT together with ALIGNHMMLIST (s^ection 8.7). As the model 
list can differ for the alignment model set a seperate set of input trans^e^^s may be specified using 
the ALIGNXFORMDIR and ALIGNXFORMEXT. p> 

HERest for updating model parameters operates in two distinct stagfe^>^ 

1. In the first stage, one of the following two options applies , 

(a) Each input data file contains training data which is processed and t^eVccumulators for 
state occupation, state transition, means and variances are updated. (3 

(b) Each data file contains a dump of the accumulators produced by previ^lS runs of the 
program. These are read in and added together to form a single set of accumulators. 

2. In the second stage, one of the following options applies 

(a) The accumulators are used to calculate new estimates for the HMM parameters. 

(b) The accumulators are dumped into a file. 

Thus, on a single processor the default combination 1(a) and 2(a) would be used. However, if 
N processors are available then the training data would be split into N equal groups and HERest 
would be set to process one data set on each processor using the combination 1(a) and 2(b). When 
all processors had finished, the program would then be run again using the combination 1(b) and 
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2(a) to load in the partial accumulators created by the N processors and do the final parameter 
re-estimation. The choice of which combination of operations HERest will perform is governed by 
the -p option switch as described below. 

As a further performance optimisation, HERest will also prune the a and (3 matrices. By this 
means, a factor of 3 to 5 speed improvement and a similar reduction in memory requirements can 
be achieved with negligible effects on training performance (see the -t option below). 

HERest is able to make use of, and estimate, linear transformations for model adaptation. 
There are three types of linear transform that are made use in HERest. 

• Input transformyt'he input transform is used to determine the forward-backward probabilities, 
hence the compi^^nt posteriors, for estimating model and transform 

• Output transform: output transform is generated when the -u option is set to a. The 
transform will be stored in the current directory, or the directory specified by the -K option 
and optionally the trai^^rm extension. 

• Parent transform: the parent transform determines the model, or features, on which the 
model set or transform isHosbe generated. For transform estimation this allows cascades of 
transforms to be used to aSm^ the model parameters. For model estimation this supports 
speaker adaptive training. No^ the current implementation only supports adaptove training 
with CMLLR. Any parent trar(^rm can be used when generating transforms. 



When input or parent transforms are spemfied the transforms may physically be stored in multple 
diirectories. Which transform to be useo^ determined in the following search order: order is used. 

1. Any loaded macro that matches the^'^^form (and its' extension) name. 

2. If it is a parent transform, the directory-^^cified with the -E option. 

3. The list of directories specified with the -X^ption. The directories are searched in the order 
that they are specified in the command lineV^ 

A. U„ >.o, ,„.aea ^.J^. . —ded ... _ 

are specified for each set of transforms generated. Trarfafdrms may either be stored in a single TMF. 
These TMFs may be loaded using the -H option. When ^^ros are specified for the regression class 
trees and the base classes the following search order is us^^ 

1. Any loaded macro that matches the macro name. • » 

2. The path specified by the configuration variable. ^-^^ 

3. The list of directories specified with the -J option. The dir^E^ries are searched in the order 
that they are specified in the command line. 

Baseclasses and regression classes may also be loaded using the -H op^^n. 

HERest can also estimate semi-tied transformations by specifying'Tiie s update option with 
the -u flag. This uses the same baseclass speicification as the linear trans^H,nation adaptation code 
to allow multiple transformations to be estimated. The specification of tlier^seclasses is identical 
to that used for linear adaptation. Updating semi-tied transforms always Updates the means and 
diagonal covariance matrices as well. Full covariance matrices are not supported. When using this 
form of estimation, full covariance statistics are accumulated. This makes the m^r^ry requirements 
large compared to estimating diagonal covariance matrices. 



17.7.2 Use 

HERest is invoked via the command line 

HERest [options] lunmList trainFile . . . 

This causes the set of HMMs given in hmmList to be loaded. The given list of training files is then 
used to perform one re-estimation cycle. As always, the list of training files can be stored in a script 
file if required. On completion, HERest outputs new updated versions of each HMM definition. If 
the number of training examples falls below a specified threshold for some particular HMM, then 
the new parameters for that HMM are ignored and the original parameters are used instead. 
The detailed operation of HERest is controlled by the following command line options 
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-a Use an input transform to obtain alignments for updating models or transforms (default off) . 

-c f Set the threshold for tied-mixture observation pruning to f. For tied-mixture TIEDHS systems, 
only those mixture component probabilities which fall within f of the maximum mixture 
component probability are used in calculating the state output probabilities (default 10.0). 

-d dir Normally HERest looks for HMM definitions (not already loaded via MMF files) in the 
current directory. This option tells HERest to look in the directory dir to find them. 

-h mask Set the mask for determining which transform names are to be used for the output trans- 
forms. If PAXF^M[MASK or INXFORMMASK are not specified then the input transform mask is 
assumed for both^^utput and parent transforms. 

-1 N Set the maximum ri^aber of files to use for each speaker, determined by the output transform 
speaker mask, to estiyiate the transform with. (default cxd). 

-m N Set the minimum nunl^>t)f training examples required for any model to N. If the actual 
number falls below this vaWe, the HMM is not updated and the original parameters are used 
for the new version (defaul^^lue 3). 

-o ext This causes the file name e^^ensions of the original models (if any) to be replaced by ext. 

-p N This switch is used to set par^kt\mode operation. If p is set to a positive integer N, then 
HERest will process the training^ies and then dump all the accumulators into a file called 
HERN.acc. If p is set to 0, then it«^re^ts all file names input on the command line as the 
names of .acc dump files. It reads ^Sjiem all in, adds together all the partial accumulations 
and then re-estimates all the HMM pdt^ineters in the normal way. 



-r This enables single-pass retraining. The liwtfflf training files is processed pair-by-pair. For each 
pair, the first file should match the paranigterisation of the original model set. The second 
file should match the parameterisation of ths^quired new set. All speech input processing is 
controlled by configuration variables in the nc^^al way except that the variables describing 
the old parameterisation are qualified by the natAe HPARMl and the variables describing the 
new parameterisation are qualified by the name flRWlM2. The stream widths for the old and 
the new must be identical. \ 
. . . 

-s file This causes statistics on occupation of each state to be output to the named file. This 
file is needed for the RO command of HHEd but it is'als^ generally useful for assessing the 
amount of training material available for each HMM staw . 

V . . 

-t f [i 1] Set the pruning threshold to f . During the backward-Brobability calculation, at each 
time t all (log) /3 values falling more than f below the maximum /? value at that time are 
ignored. During the subsequent forward pass, (log) a valueSv,We only calculated if there are 
corresponding valid (3 values. Furthermore, if the ratio of the a/J^^oduct divided by the total 
probability (as computed on the backward pass) falls below a fixe^'^fe>:eshold then those values 
of a and (3 are ignored. Setting f to zero disables pruning (default ^^ue 0.0). Tight pruning 
thresholds can result in HERest failing to process an utterance. L&rthe i and 1 options 



are given, then a pruning error results in the threshold being increased by i and utterance 
processing restarts. If errors continue, this procedure will be repeated until the limit 1 is 
reached. 

-u flags By default, HERest updates all of the HMM parameters, that is, rlmans, variances, 
mixture weights and transition probabilies. This option causes just the paran^^rs indicated 
by the flags argument to be updated, this argument is a string containing one or more of the 
letters m (mean), v (variance) , t (transition), a (linear transform), p (use MAP adaptation), s 
(semi-tied transform), and w (mixture weight). The presence of a letter enables the updating 
of the corresponding parameter set. 

-V f This sets the minimum variance (i.e. diagonal element of the covariance matrix) to the real 
value f (default value 0.0). 

-w f Any mixture weight which falls below the global constant MINMIX is treated as being zero. 
When this parameter is set, all mixture weights are floored to f * MINMIX. 
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-L dir Search directory dir for label hfes (default is to search current directory). 



-X ext By default, HERest expects a HMM definition for the label X to be stored in a file called 
X. This option causes HERest to look for the HMM definition in the file X.ext. 

-z file Save all output transforms to file. Default is TMF. 

-B Output HMM definition files in binary format. 

-E dir [ext] Parent transform directory and optional extension for parent transforms. The de- 
fault option is that no parent transform is used. 

-F fmt Set the sourq^data format to fmt. 

-G fmt Set the label fiwiormat to fmt. 

-H mmf Load HMM macrrTmodel file mmf . This option may be repeated to load multiple MMFs. 

-I mlf This loads the mast^^bel file mlf . This option may be repeated to load several MLFs. 

-J dir [ext] Add directory he list of possible input transform directories. Only one of the 
options can specify the ext^^sion to use for the input transforms. 

-K dir [ext] Output transform ^'^^ectory and optional extension for output transforms. The 
default option is that there is n^^utput extension and the current transform directoryis used. 

-M dir Store output HMM macro model "^s in the directory dir. If this option is not given, the 
new HMM definition will overwrite th^^^isting one. 

-X ext Set label file extension to ext (defauhtis- lab). 

A 

HERest also supports the standard options -A,^C,.-JD, -S, -T, and -V as described in section 4.4. 
17.7.3 Tracing O 

HERest supports the following trace options where ea^^race flag is given using an octal base 

00001 basic progress reporting. 

00002 show the logical/physical HMM map. 

00004 list the updated model parameters, of tied mixture com^p*^ents. 

Trace flags are set using the -T option or the TRACE configura{2ii variable. 



o 
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17.8 HHEd 

17.8.1 Function 

HHEd is a script driven editor for manipulating sets of HMM definitions. Its basic operation is to 
load in a set of HMMs, apply a sequence of edit operations and then output the transformed set. 
HHEd is mainly used for applying tyings across selected HMM parameters. It also has facilities 
for cloning HMMs, clustering states and editing HMM structures. 

Many HHEd commands operate on sets of similar items selected from the set of currently loaded 
HMMs. For example^^ is possible to define a set of all final states of all vowel models, or all mean 
vectors of all mixture 'mijiponents within the model X, etc. Sets such as these are defined by item 
lists using the syntax rtZigS given below. In all commands, all of the items in the set defined by an 
item list must be of the sf^jie type where the possible types are 

s - state • t - transition matrix 

p - pdf \S^j. w - stream weights 

m - mixture colf^gonent d - duration parameters 

u - mean vector rsA x - transform matrix 

V - variance vectOT^^ i - inverse covariance matrix 

h - HMM definition.^ 

Most of the above correspond directly t/frthe tie points shown in Fig 7.8. There is just one exception. 
The type "p" corresponds to a pdf (ie a"S3Ja3i of Gaussian mixtures). Pdf's cannot be tied, however, 
they can be named in a Tie (Tl) commaiifl (sge below) in which case, the effect is to join all of the 
contained mixture components into one pb(pi of mixtures and then all of the mixtures in the pool 
are shared across all pdf's. This allows convS^onal tied-mixture or semi-continuous HMM systems 
to be constructed. 

The syntax rules for item lists are as foUows.^^n item list consists of a comma separated list of 
item sets. ^ > , 

itemList = "{" itemSet { "," iteq^S^t } "}" 

Each itemSet consists of the name of one or more HMIfc^or a pattern representing a set of HMMs) 
followed by a specification which represents a set of waKhs down the parameter hierarchy each 
terminating at one of the required parameter items. 

itemSet = hmmName . ["transP" | "statg" state ] 

hmmName = ident | identList 

identList — "(" ident { "," ident } ")" ^ 

ident = < char | metachar > \ 

metachar = "?" | 'V \J 

A hmmName consists of a single ident or a comma separated list o . The following examples 

are all valid hmmName's: ^ v 

aa three modelOOl (aa, iy ,ah,uh) (one, two, three) 

In addition, an ident can contain the metacharacter "?" which matches ajw^single character and 
the metacharacter "★" which matches a string of zero or more characters. For example, the item 

list 

■[*-aa+*.transP} Q 

would represent the set of transition matrices of all loaded triphone variations of a^ 
Items within states require the state indices to be specified 

state = index ["." stateComp ] 

index = "[" intRange { "," intRange } "]" 

intRange = integer [ "-" integer ] 

For example, the item list 

{♦.state [1,3-5, 9]} 

represents the set of all states 1, 3 to 5 inclusive and 9 of all currently loaded HMMs. Items within 
states include durational parameters, stream weights, pdf's and all items within mixtures 
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stateComp = "dur" | "weights" | [ " stream" index ] "." "mix" [ mix ] 

For example, 

{ (aa, cLti, ax) . state [2] . dur} 

denotes the set of durational parameter vectors from state 2 of the HMMs aa, ah and ax. Similarly, 

{*. state [2-4] .weights} 

denotes the set of stream weights for states 2 to 4 of all currently loaded HMMs. The specification 
of pdf 's may optionally include a list of the relevant streams, if omitted, stream 1 is assumed. For 
example, 

{three . sta^^^] .mix} 

and ^ 

{three . state [3]^^^ream [1] .mix} 

both denote a list of the singlof^f belonging to stream 1 of state 3 of the HMM three. 

Within a pdf, the possible ^wiA types are mixture components, mean vectors, and the various 
possible forms of covariance parame^rs 



mix 



= inde)5rTrN." ( "mean" | "cov" ) ] 



For example. 



{* . state [2] .mix [1-3] } >0 

denotes the set of mixture components 1 from state 2 of all currently loaded HMMs and 

{ (one , two) . state [4] . stresmi [I^J^ix [1] .mean} 

denotes the set of mean vectors from mixture^^owiponent 1, stream 3, state 4 of the HMMs one 
and two. When cov is specified, the type of the sovariance item referred to is determined from the 
CovKind of the loaded models. Thus, for diagonaP^variance models, the item list 

{* . state [2-4] .mix [1] . cov} C\ 

would denote the set of variance vectors for mixture iV^ates 2 to 4 of all loaded HMMs. 

Note finally, that it is not an error to specify non-existent models, states, mixtures, etc. All 
item list specifications are regarded as patterns which are ^)ttched against the currently loaded set 
of models. All and only those items which match are included in the set. However, both a null 
result and a set of items of mixed type do result in errors. 

All HHEd commands consist of a 2 character command nSni^followed by zero or more argu- 
ments. In the following descriptions, item lists are shown as iYemList(c) where the character c 
denotes the type of item expected by that command. If this ty^ indicator is missing then the 
command works for all item types. 

The HHEd commands are as follows 

AT i j prob itemList(t) ^-^ 

Add a transition from state i to state j with probability prob for all transitio^r^atrices in itemList. 
The remaining transitions out of state i are rescaled so that a^fe = 1. For example, 

AT 1 3 0.1 {*.transP} ^ 

would add a skip transition to all loaded models from state 1 to state 3 with pro^ajDility 0.1. 

AU hmmList 

Use a set of decision trees to create a new set of models specified by the hmmList. The decision 
trees may be made as a result of either the TB or LT command. 

Each model in hmmList is constructed in the following manner. If a model with the same 
logical name already exists in the current HMM set this is used unchanged, otherwise the model 
is synthesised from the decision trees. If the trees cluster at the model level the synthesis results 
in a logical model sharing the physical model from the tree that matches the new context. If the 
clustering was performed at the state level a prototype model (an example of the same phone model 
occurring in a different context) is found and a new HMM is constructed that shares the transition 
matrix with the prototype model but consists of tied states selected using the decision tree. 
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CL hmmList 

Clone a HMM list. The file hmmList should hold a list of HMMs all of whose logical names are 
either the same as, or are context-dependent versions of the currently loaded set of HMMs. For each 
name in hmmList, the corresponding HMM in the loaded set is cloned. On completion, the currently 
loaded set is discarded and replaced by the new set. For example, if the file mylist contained 

A-A+A 
A-A+B 

B-A+A . 
B-B+B ^ 
B-B+A ^ 



C 
D 



and the currently loaded iStiMs were just A and B, then A would be cloned 3 times to give the models 
A-A+A, A-A+B and B-A+A, a»id B would be cloned 2 times to give B-B+B and B-B+A. On completion, 
the original definitions for A-^Jtl B would be deleted (they could be retained by including them in 
the new hmmList). 

CD newList 

Compact a set of HMMs. The effect/^this command is to scan the currently loaded set of HMMs 
and identify all identical definitions, physical name of the first model in each identical set is 
then assigned to all models in that ser-Mtd all model definitions are replaced by a pointer to the 
first model definition. On completion, a list of HMMs which includes the new model tyings is 
written out to file newList. For example, ^^tlppose that models A, B, C and D were currently loaded 
and A and B were identical. Then the comm^^ 

CO tlist 

would tie HMMs A and B. set the physical name of B'to A and output the new HMM list 
B A A 

to the file tlist. This command is used mainly after perijirming a sequence of parameter tying 
commands. 

DP s n id ... 

Duplicates a set of HMMs. This command is used to replicate a set of^JMMs whilst allowing control 
over which structures will be shared between them. The first pararnefes^controls duplication of tied 
structures. Any macros whose type appears in string s are duplicateawith new names and only 
used in the duplicate model set. The remaining shared structures ^ff^U;ommon through all the 
model sets (original and duplicates) . The second parameter defines the nu@)er of times the current 
HMM set should be duplicated with the remaining n parameters providii^^suffices to make the 
original macro identifiers unique in each duplicated HMM set. 

For instance the following script could be used to duplicate a set of tied statejmodels to produce 
gender dependent ones with tied variances. v 

MM "v_" { (*) .state[2-4] .mix[l-2] .cov } 

DP "v" 2 ":m" ":f" V 

The MM command converts all variances into macros (with each macro referring to only one variance) . 
The DP command then duplicates the current HMM set twice. Each of the duplicate sets will share 
the tied variances with the original set but will have new mixture means, weights and state macros. 
The new macro names will be constructed by appending the id " :m" or " :f " to the original macro 
name whilst the model names have the id appended after the base phone name (so ax-b+d becomes 
ax-b:m+d or ax-b:f+d. 
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FA varscale 

Computes an average within state variance vector for a given HMM set, using statistics generated 
by HERest (see LS for loading stats). The average variance vector is scaled and stored in the HMM 
set, any variance floor vectors present are replaced. Subsequently, the variance floor is applied to 
all variances in the model set. This can be inhibited by setting APPLYVFLDOR to FALSE. 

FC 

Converts all covarianc(^s in the modelset to full. This command takes an HMM set with diagonal 
covariances and crea'fe^fuU covariances which are initialised with the variances of the diagonal 
system. The tying stru^ffe of the original system is kept intact. 

FV file 

• 

Loads one variance floor maO^^per stream from flic. The flle containing the variance floor macros 
can, for example, be generateci'w HCompV. Any variance floor vectors present in the model set 
are replaced. Secondly the vamtnce floor is applied to all variances. This can be inhibited but 
setting APPLYVFLOOR to FALSE. 

HK hsetkind O 

Converts model set from one kind to ^mJther. Although hsetkind can take the value PLAINHS, 
SHAREDHS, TIEDHS or DISCRETEH§<ttie,HK command is most likely to be used when building 
tied-mixture systems (hsetkind=TIEDHS)^ 

JD size minw ■f\j. 

Set the size and minimum mixture weight for sulS^quent Tie (Tl) commands applied to pdf 's. The 
value of size sets the total number of mixtures to^the tied mixture set (codebook) and minw sets 
a floor on the mixture weights as a multiple of MIN^i^. This command only applies to tying item 
lists of type "p" (see the Tie TI command below). ^ 

<^ 

LS statsfile ^0 

This command is used to read in the HERest statistics flle (see the HERest -s option) stored in 
statsfile. These statistics are needed for certain clustering (m^aXions. The statistics file contains 
the occupation count for every HMM state. '^-^^ 

LT treesfile C3 

This command reads in the decision trees stored in treesfile. Th^^^ees file will consist of a set of 
questions defining contexts that may appear in the subsequent trees, ^jife trees are used to identify 
either the state or the model that should be used in a particular cont'fJ^The file would normally 
be produced by ST after tree based clustering has been performed. r~\ 

MD nmix itemlist 

• 

Decrease the number of mixture components in each pdf in the itemList to(m!y This employs a 
stepwise greedy merging strategy. For a given set of mixture components the j/aiy with minimal 
merging cost is found and merged. This is repeated until only m mixture compone^t^jare left. Any 
defunct mixture components (i.e. components with a weight below MINMIX) are d^ted prior to 
this process. 

Note that after application of this command a pdf in itemlist may consist of fewer, but not 
more than m mixture components. 
As an example, the command 

MD 6 {*-aa+*.state[3] .mix} 



would decrease the number of mixture components in state 3 of all triphones of aa to 6. 
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MM macro itemList 

This command makes each item (1=1. .N) in itemList into a macro with name namel and a usage 
of one. This command can prevent unnecessary duplication of structures when HMMs are cloned 
or duplicated. 



MT triList newTriList 

Make a set of triphones by merging the currently loaded set of biphones. This is a very specialised 
command. All currently loaded HMMs must have 3 emitting states and be either left or right 
context-dependent bl^wnes. The list of HMMs stored in triList should contain one or more 
triphones. For each trip^pe in triList of the form X-Y+Z, there must be currently loaded biphones 
X-Y and Y+Z. A new tripl^ie X-Y+Z is then synthesised by first cloning Y+Z and then replacing the 
state information for the initial emitting state by the state information for the initial emitting 
state of X-Y. Note that the 'underlying physical names of the biphones used to create the triphones 
are recorded so that where Vt%ible, triphones generated from tied biphones are also tied. On 
completion, the new list of triM^es including aliases is written to the file newTriList. 

MU m itemList (p) 

Increase the number of non-defunct ^\)xture components in each pdf in the itemList to m (when 
m is just a number) or by m (when m^)a number preceeded by a + sign. A defunct mixture is 
one for which the weight has fallen belojis^INMIX. This command works in two steps. Firstly, the 
weight of each mixture in each pdf is che^edr If any defunct mixtures are discovered, then each is 
successively replaced by a non-defunct mixwire component until either the required total number of 
non-defunct mixtures is reached or there aie^tm defunct mixtures left. This replacement works by 
first deleting the defunct mixture and then fin(Jj^ the mixture with the largest weight and splitting 
it. The split operation is as follows. The weight ^dKthe mixture component is first halved and then 
the mixture is cloned. The two identical mean vectors are then perturbed by adding 0.2 standard 
deviations to one and subtracting the same amount fpem the other. 

In the second step, the mixture component with- the largest weight is split as above. This is 
repeated until the required number of mixture compc^i^hts are obtained. Whenever, a mixture is 
split, a count is incremented for that mixture so that s^^^ting occurs evenly across the mixtures. 
Furthermore, a mixture whose gconst value falls more than^^r standard deviations below the mean 
is not split. 

As an example, the command ^ 
MU 6 {*-aa+*.state[3] .mix} 

would increase the number of mixture components in state 3 of a(Oriphones of aa to 6. 

NC N macro itemList (s) \^ 

N-cluster the states listed in the itemList and tie each cluster i ^fs^^acro macroi where i is 
1,2,3,. . . ,N. The set of states in the itemList are divided into N clusters uS^n^ the following furthest 
neighbour hierarchical cluster algorithm: ^\ 

create 1 cluster for each state; • 
n = number of clusters ; 

while (n>N) { Q 
find i and j for which g(i,j) is minimum; 

merge clusters i and j ; < 

y 

Here g(i, j) is the inter-group distance between clusters i and j defined as the maximum distance 
between any state in cluster i and any state in cluster j . The calculation of the inter-state distance 
depends on the type of HMMs involved. Single mixture Gaussians use 



^ s=l 



Vs f \'2 
1 \A \^J'isk ~ t^jsk)^ 



Vs ^ cr iskC jsk 



(17.1) 
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where Vs is the dimensionahty of stream s. Fully tied mixture systems (ie TIEDHS) use 



d'ihj) — a ^ ] ^ ] {Cjsm Cjsm) 

and all others use 



m— 1 



(17.2) 



^ 5 ^ Ms 

= -g'^JfYl log[&js(/x,,.„)] + log[b,s{Hjsm)] (17-3) 

. s — 1 m— 1 



where bjs{x) is as defii^Mn equation 7.1 for the continuous case and equation 7.3 for the discrete 
case. The actual tying of states in each cluster is performed exactly as for the Tie (Tl) command 
below. The macro for the ^th tied cluster is called macroi. 

This command takes a model set([^^at has been estimated with an HLDA transform, but storead as 
a semi-tied transform rather thauySt^input transform and transforms it into a model-set with the 
projected number of dimensions anl^^an input transform. 



LS <statsfile> 
PS 16 0.2 



PS nstates power [numiters] 

This command sets the number of GausSTaHS- in each HMM state proportional to a power of the 
number of frames available for training itA "Hie number of frames is obtained from a "stats" file 
output by HERest, which is loaded by thews' command. Typical usage might be: 

in order to acheive an average of 16 Gaussians per s^^i^e with a power of 0.2. 

It is always advisable when increasing the number Gaussians in states to increase the number 
by small increments and re-estimate HMMs using HB^^T once or more in between. It may be 
difficult to avoid a large increase in number of Gaussickis in particular states when moving from 
a HMM set with a constant number of Gaussians per ^^e to one controlled by a power law. 
Therefore the PS command has a facility for increasing the,number of Gaussians gradually where 
the target is larger than the initial number of Gaussians, so\tmt HERest can be run in between. 
In this example, one could use the HHEd command PS 16 0.2~3^un HERest, use the command 
PS 16 0.2 2, run HERest, and then run PS 16 0.2 1 before the^^al re-estimation with HERest. 
The last argument is the number of iterations remaining. A fairl;^!sJmilar effect could be obtained 
by increasing the power linearly from zero. 

QS nsune itemList (h) vl) 

Define a question name which is true for all the models in itemList. T]@e questions can subse- 
quently be used as part of the decision tree based clustering procedure (se^^§ command below) . 

RC N identifier [itemlist] 

This command is used to grow a regression class tree for adaptation purposes, .^^egression class 
tree is grown with N terminal or leaf nodes, using the centroid splitting algorithm wipa a Euclidean 
distance measure to cluster the model set's mixture components. Hence each leaf node specifies a 
particular mixture component cluster. The regression class tree is saved with the macro identifier 
identif ier_N. Each Gaussian component is also labelled with a regression class number (corre- 
sponding to the leaf node number that the Gaussian component resides in) . In order to grow the 
regression class tree it is necessary to load in a statsf ile using the LS command. It is also pos- 
sible to specify an itemlist containing the "non-speech" sound components such as the silence 
mixture components. If this is included then the first split made will result in one leaf containing 
the specified non-speech sound conmponents, while the other leaf will contain the rest of the model 
set components. Tree contruction then continues as usual. 
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RN hmmldName 

Rename or add the hmm set identifier in the global options macro to hmmldName. 



RM hmmFile 

Load the hmm from hmmFile and subtract the mean from state 2, mixture 1 of the model from 
every loaded model. Every component of the mean is subtracted including deltas and accelerations. 



RD f [statsfile] a 

This command is used^tt) remove outlier states during clustering with subsequent NC or TC com- 
mands. If statsfile ispresent it first reads in the HERest statistics file (see LS) otherwise it 
expects a separate LS coiSiknd to have already been used to read in the statistics. Any subsequent 
NC, TC or TB commands ar« extended to ensure that the occupancy clusters produced exceeds the 
threshold f . For TB this is us^^o choose which questions are allowed to be used to split each node. 
Whereas for NC and TC a final ^rorging pass is used and for as long the smallest cluster count falls 
below the threshold f , then tha^^^ster is merged with its nearest neighbour. 

RT i j itemList(t) 

Remove the transition from state i^o!r< in all transition matrices given in the itemList. After 
SH V 

Show the current HMM set. This command^can be inserted into edit scripts for debugging. It 
prints a summary of each loaded HMM identi^ng. any tied parameters. 

SK skind 

Change the sample kind of all loaded HMMs to 

sA. This command is typically used in con- 
junction with the SW command. For example, to addCS^lta coefficients to a set of models, the SW 
command would be used to double the stream widths then this command would be used to 
add the _D qualifier. ^ 

SS N *^ 

Split into N independent data streams. This command causes^tjj^ currently loaded set of HMMs 
to be converted from 1 data stream to N independent data streasas. The widths of each stream 
are determined from the single stream vector size and the sample itind as described in section 5.13. 
Execution of this command will cause any tyings associated with ^Ji^plit stream to be undone. 

ST fileneune V\J 

Save the currently defined questions and trees to file filename. This allowCsiibsequent construction 
of models using for new contexts using the LT and AU commands. 

SU N wl w2 w3 . . wN ^ 

Split into N independent data streams with stream widths as specified. This co@nand is similar 
to the SS command except that the width of each stream is defined explicity byv^Be user rather 
than using the built-in stream splitting rules. Execution of this command will catlse any tyings 
associated with the split stream to be undone. 



SW s n 

Change the width of stream s of all currently loaded HMMs to n. Changing the width of stream 
involves changing the dimensions of all mean and variance vectors or covariance matrices. If n 
is greater than the current width of stream s, then mean vectors are extended with zeroes and 
variance vectors are extended with I's. Covariance matrices are extended with zeroes everywhere 
except for the diagonal elements which are set to 1 . This command preserves any tyings which may 
be in force. 
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TB f macro itemList (s or h) 

Decision tree cluster all states in the given itemList and tie them as macroi where i is 1,2,3,. . . . 
This command performs a top down clustering of the states or models appearing in itemlist. 
This clustering starts by placing all items in a single root node and then choosing a question from 
the current set to split the node in such a way as to maximise the likelihood of a single diagonal 
covariance Gaussian at each of the child nodes generating the training data. This splitting continues 
until the increase in likelihood falls below threshold f or no questions are available which do not pass 
the outlier threshold test. This type of clustering is only implimented for single mixture, diagonal 
covariance untied models. 



TC f macro itemList (ifty^ 

^ . 

Cluster all states in the given itemList and tie them as macroi where i is 1,2,3,. . . . This command 
is identical to the NC comrnand described above except that the number of clusters is varied such 
that the m«i,™m within dyLdistance is less than the yalne gi«n by t. 

TI macro itemList ^) 

Tie the items in itemList and assign them to the specified macro name. This command applies to 
any item type but all of the items in^iemList must be of the same type. The detailed method of 
tying depends on the item type as foUi^^: 

state(s) the state with the largest totaTv^lue of gConst in stream 1 (indicating broad variances) 
and the minimum number of defuncf^^ixture weights (see MU command) is selected from the 
item list and all states are tied to thisvf^ical state. 

transitions(t) all transition matrices in the Jt^j^list are tied to the last in the list. 

mixture(m) all mixture components in the item^^t are tied to the last in the list. 

mean(u) the average vector of all the mean vectors(n^ the item list is calculated and all the means 
are tied to this average vector. 

variance(v) a vector is constructed f„t which each e,:i£5,t is the maximum of the corresponding 
elements from the set of variance vectors to be tied.^il of the variances are then tied to this 
maximum vector. , 

tie^^ the 1 

xform(x) all transform matrices in the item list are tied to the in the list. 

duration(d) all duration vectors in the item list are tied to the l^^)gi the list. 

stream weights (w) all stream weight vectors in the item list are tiS^^ the last in the list. 

pdf(p) as noted earlier, pdf's are tied to create tied mixture sets rathe^^an to create a shared 
pdf. The procedure for tying pdf's is as follows 

1. All mixtures from all pdf's in the item list are collected togethei^ in order of mixture 
weight. 

2. If the number of mixtures exceeds the join size J [see the Join (JD) ^ojnmand above], 
then all but the first J mixtures are discarded. 

3. If the number of mixtures is less than J, then the mixture with the laJ^est weight is 
repeatedly split until there are exactly J mixture components. The split procedure used 
is the same as for the MixUp (MU) command described above. 

4. All pdf's in the item list are made to share all J mixture components. The weight for 
each mixture is set proportional to the log likelihood of the mean vector of that mixture 
with respect to the original pdf. 

5. Finally, all mixture weights below the floor set by the Join command are raised to the 
floor value and all of the mixture weights are renormalised. 



covariance(i) all covariance matrices in the item list are tied-feb the last in the list. 
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TR n 

Change the level of detail for tracing and consists of a number of separate flags which can be added 
together. Values 0001, 0002, 0004, 0008 have the same meaning as the command line trace level but 
apply only to a single block of commands (a block consisting of a set of commands of the name). 
A value of 0010 can be used to show current memory usage. 



XF filenemie 



UT itemList 

Untie all items in itejpList. For each item in the item list, if the usage counter for that item is 
greater than 1 then f^is cloned, the original shared item is replaced by the cloned copy and the 
usage count of the sha^ci item is reduced by 1. If the usage count is already 1, the associated 
macro is simply deleted the usage count set to 0 to indicate an unshared item. Note that it is 
not possible to untie a pdisince these are not actually shared [see the Tie (Tl) command above]. 

Sets the input transform of the'^^el-set to be filename. 
17.8.2 Use 

HHEd is invoked by typing the comm^SM line 
HHEd [options] edCmdFile hmmLisix 

. . V^' 

where edCmdFile is a text file containing a^quence of edit commands as described above and 
hmmList defines the set of HMMs to be edited (see HModel for the format of HMM hst). If 
the models are to be kept in separate files ratkar than being stored in an MMF, the configuration 
variable KEEPDISTINCT should be set to true. Tll^available options for HHEd are 

-d dir This option tells HHEd to look in the dirS^iS^y dir to find the model definitions. 

-o ext This causes the file name extensions of the or^^jial models (if any) to be replaced by ext. 

-w mmf Save all the macros and model definitions in a ^n^e master macro file mmf . 

-x s Set the extension for the edited output files to be s (default is to to use the original names 
unchanged). 

-z Setting this option causes all aliases in the loaded HMM sety^e^e deleted (zapped) immediately 
before loading the definitions. The result is that all logical Bcrspes are ignored and the actual 
HMM list consists of just the physically distinct HMMs. ^-jL 

-B Output HMM definition files in binary format. 



-H mmf Load HMM macro model file mmf. This option may be repeareSjo load multiple MMFs. 

-M dir Store output HMM macro model files in the directory dir. If thi^^oMion is not given, the 
new HMM definition will overwrite the existing one. ^ 

-Q Print a summary of all commands supported by this tool. 
HHEd also supports the standard options -A, -C, -D, -S, -T, and -V as describeo-ik section 4.4. 

17.8.3 Tracing 

HHEd supports the following trace options where each trace fiag is given using an octal base 

00001 basic progress reporting. 

00002 intermediate progress reporting. 
00004 detailed progress reporting. 



00010 show item lists used for each command. 
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00020 show memory usage. 

00100 show changes to macro definitions. 

00200 show changes to stream widths. 

00400 show clusters. 

00800 show questions. 

01000 show tree filte^ig. 

02000 show tree sphttj^ 

04000 show tree mergin 

10000 show good question *sc^(^s. 

20000 show all question scoresr^ 

40000 show all merge scores. i 

Trace flags are set using the -T optf^^or the TRACE configuration variable. 

\ 

\ 

(J) 
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17.9 HInit 
17.9.1 Function 

HInit is used to provide initial estimates for the parameters of a single HMM using a set of observa- 
tion sequences. It works by repeatedly using Viterbi alignment to segment the training observations 
and then recomputing the parameters by pooling the vectors in each segment. For mixture Gaus- 
sians, each vector in each segment is aligned with the component with the highest likelihood. Each 
cluster of vectors then determines the parameters of the associated mixture component. In the 
absence of an initial ^odel, the process is started by performing a uniform segmentation of each 
training observation am^iox mixture Gaussians, the vectors in each uniform segment are clustered 
using a modified K-MeaS^algorithm*^ . 

HInit can be used to j^oj^ide initial estimates of whole word models in which case the observation 
sequences are realisations the corresponding vocabulary word. Alternatively, HInit can be used 
to generate initial estimates.-.a& seed HMMs for phoneme-based speech recognition. In this latter 
case, the observation sequenc^^ill consist of segments of continuously spoken training material. 
HInit will cut these out of the gaining data automatically by simply giving it a segment label. 

In both of the above applical^^s^ HInit normally takes as input a prototype HMM definition 
which defines the required HMM 'tOT>ology i.e. it has the form of the required HMM except that 
means, variances and mixture weightf^e ignored' . The transition matrix of the prototype specifies 
both the allowed transitions and theiprsxiitial probabilities. Transitions which are assigned zero 
probability will remain zero and henceMenote non-allowed transitions. HInit estimates transition 
probabilities by counting the number of t^es each state is visited during the alignment process. 

HInit supports multiple mixtures, muK^le streams, parameter tying within a single model, full 
or diagonal covariance matrices, tied-mixtus^jnodels and discrete models. The output of HInit is 
typically input to HRest. 



Like all re-estimation tools, HInit allows a ffofflrto be set on each individual variance by defining 
a variance floor macro for each data stream (seex;napter 8). 

17.9.2 Use C\ 

HInit is invoked via the command line ^"^^ 

HInit [options] hmm trainFiles . . . 

This causes the means and variances of the given hmm to b% ^imated repeatedly using the data 
in trainFiles until either a maximum iteration limit is reac^ep or the estimation converges. The 
HMM definition can be contained within one or more macro files^^kt^ded via the standard -H option. 
Otherwise, the definition will be read from a file called hmm. The^feJ; of train files can be stored in 
a script file if required. ^ 

The detailed operation of HInit is controlled by the following ^eaImnand line options 

-e f This sets the convergence factor to the real value f . The convergence factor is the relative 
change between successive values of Pmax{.0\X) computed as a c^^roduct of the Viterbi 
alignment stage (default value 0.0001). ^-^^ 

-i N This sets the maximum number of estimation cycles to N (default value^20). 

-1 s The string s must be the name of a segment label. When this option is u^e^, HInit searches 
through all of the training files and cuts out all segments with the given ^a^el. When this 
option is not used, HInit assumes that each training file is a single token, 

-m N This sets the minimum number of training examples so that if fewer than N examples are 
supplied an error is reported (default value 3). 

-n This flag suppresses the initial uniform segmentation performed by HInit allowing it to be 
used to update the parameters of an existing model. 

®This algorithm is significantly different from earlier versions of HTK where K-means clustering was used at every 
iteration and the Viterbi alignment was limited to states 

^Prototypes should either have GConst set (the value does not matter) to avoid HTK trying to compute it or 
variances should be set to a positive value such as 1.0 to ensure that GConst is computable 
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-B Output HMM definition fires. in binary format. 



-o s The string s is used as the name of the output HMM in place of the source name. This is 
provided in HInit since it is often used to initiahse a model from a prototype input definition. 
The default is to use the source name. 

-u flags By default, HiNiT updates all of the HMM parameters, that is, means, variances, mixture 
weights and transition probabilities. This option causes just the parameters indicated by the 
flags argument to be updated, this argument is a string containing one or more of the letters 
m (mean), v (variance), t (transition) and w (mixture weight). The presence of a letter enables 
the updating of the corresponding parameter set. 

-V f This sets the n^imum variance (i.e. diagonal element of the covariance matrix) to the real 
value f The defaT^jvalue is 0.0. 

-w f Any mixture weigh^^ discrete observation probability which falls below the global constant 
MINMIX is treated as feeing zero. When this parameter is set, all mixture weights are floored 
to f * MINMIX. tj^ 

rition fires^ir 

-F fmt Set the source data forma^^ fmt. 

-G fmt Set the label file format to 

-H mmf Load HMM macro model file ^^^This option may be repeated to load multiple MMFs. 

-I mif This loads the master label file ml^This option may be repeated to load several MLFs. 

-L dir Search directory dir for label files (^e^ault is to search current directory). 

-M dir Store output HMM macro model files im-^ie directory dir. If this option is not given, the 
new HMM definition will overwrite the existm^one. 

-X ext Set label file extension to ext (default is la^^ 

HInit also supports the standard options -A, -C, -D, -T, and -V as described in section 4.4. 
17.9.3 Tracing 

HInit supports the following trace options where each trac? flgg is given using an octal base 



000001 basic progress reporting. 

000002 file loading information. C3 

r\ 

000004 segments within each file. ^"-^/v 
000010 uniform segmentation. x\) 
000020 Viterbi ahgnment. 
000040 state alignment. 

000100 mixture component alignment. ^\ 
000200 count updating. 

000400 output probabihties. ^ 
Trace flags are set using the -T option or the TRACE configuration variable. 
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17.10 HLEd 
17.10.1 Function 

This program is a simple editor for manipulating label files. Typical examples of its use might be to 
merge a sequence of labels into a single composite label or to expand a set of labels into a context 
sensitive set. HLEd works by reading in a list of editing commands from an edit script file and then 
makes an edited copy of one or more label files. For multiple level files, edit commands are applied 
to the current level which is initially the first (i.e. 1). Other levels may be edited by moving to the 
required level using th* ML Move Level command. 

Each edit commanff^ the script file must be on a separate line. The first two- letter mnemonic on 
each line is the commanE^ame and the remaining letters denote labels'^ . The commands supported 
may be divided into two^bs. Those in the first set are used to edit individual labels and they are 
as follows 



CH X A Y B Change Y in tM context of A_B to X. A and/or B may be a * to match any context, 
otherwise they must be Refined by a DC command (see below) . A block of consecutive CH 
commands are effectively e^^uted in parallel so that the contexts are those that exist before 
any of the commands in the^^ck are applied. 

DC A B C . . define the context A a ^e set of labels B, C, etc. 

DE A B . . Delete any occurrences (^^^els A or B etc. 

FI A Y B Find Y in the context of kJBjikid count the number of occurrences. 



ME X A B . . Merge any sequence of labels C etc. and call the new segment X. 
ML N Move to label level N. ^ 



RE X A B . . Replace all occurrences of labels A Op-^ etc. by the label X. 

o . 

The commands in the second set perform global operations on whole transcriptions. They are 
as follows. S^v^ 

DL [N] Delete all labels in the current level. If the ^J;ional integer arg is given, then level N 

is deleted. 

EX Expand all labels either from words to phon^^^sing the first pronunciation from 

a dictionary when it is specified on the command line o^^ier^wise expand labels of the form 
A_B_C_D_. . . into a sequence of separate labels A B C D . . .iTNThis is useful for label formats 
which include a complete orthography as a single label or for Treating a set of sub-word labels 
from a word orthography for a sub-word based recogniser. a label is expanded in this 

way, the label duration is divided into equal length segments, can only be performed on 
the root level of a multiple level file. ^"Xl) 

FG X Mark all unlabelled segments of the input file of duration g/e*^er than Tg msecs with 

the label X. The default value for Tg is 50000.0 (=5msecs) but this ca^ be changed using the 
-g command line option. This command is mainly used for explicitlji labelling inter- word 
silences in data files for which only the actual speech has been transcribe^T^ 

IS A B Insert label A at the start of every transcription and B at the end. Bis command is 

usually used to insert silence labels. \^ 

IT Ignore triphone contexts in CH and FI commands. 

LC [X] Convert all phoneme labels to left context dependent. If X is given then the first 

phoneme label a becomes X-a otherwise it is left unchanged. 

NB X The label X (typically a short pause) should be ignored at word boundaries when 

using the context commands LC, RC and TC. 

*In earlier versions of HTK, HLEd command names consisted of a single letter. These are still supported for 
backwards compatibility and they are included in the command summary produced using the -Q option. However, 
commands introduced since version 2.0 have two letter names. 
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RC [X] Convert all phoneme labels to right context dependent. If X is given then the last 

phoneme label z becomes z+X otherwise it is left unchanged. 

SB X Define the label X to be a sentence boundary marker. This label can then be used in 

context-sensitive change commands. 

SO Sort all labels into time order. 

SP Split multiple levels into multiple alternative label lists. 

TC [X[Y]] Conver^^U phoneme labels to Triphones, that is left and right context dependent. If 
X is given then nip\first phoneme label a becomes X-a+b otherwise it is left unchanged. If Y 
is given then the l^J^phoneme label z becomes y-z+Y otherwise if X is given then it becomes 
y-z+X otherwise it ^ieft unchanged. 

WB X Define X to be aia inter-word label. This command affects the operation of the LC, RC 

and TC commands. ThVexpansion of context labels is blocked wherever an inter- word label 



The source and target label me^ormats can be defined using the -G and -P command line 
arguments. They can also be set uSifmAhe configuration variables SOURCELABEL and TARGETLABEL. 
The default for both cases is the HT^yformat. 

17.10.2 Use V 

V' 

HLEd is invoked by typing the command 

HLEd [options] edCmdFile labFiles 

This causes HLEd to be applied to each labC^Le in turn using the edit commands listed in 
edCmdFile. The labFiles may be master label fiS^. The available options are 

o 

-b Suppress label boundary times in output files. 

-d s Read a dictionary from file s and use this for expa](ld5ng labels when the EX command is used. 

-i mlf This specifies that the output transcriptions are -^^ten to the master label file mlf . 

-g t Set the minimum gap detected by the FG to be t (de^a^^ft 50000.0 — Smsecs). All gaps of 
shorter duration than t are ignored and not labelled. ^■^^ 

-1 s Directory to store output label files (default is current dirq|C^^ry). When output is directed 
to an MLF, this option can be used to add a path to each OJrtput file name. In particular, 
setting the option -1 ' * ' will cause a label file named xxis-ro -be prefixed by the pattern 
"*/xxx" in the output MLF file. This is useful for generating MEFs which are independent 
of the location of the corresponding data files. ^<v>J 

-m Strip all labels to monophones on loading. ^^^^ 

-n f n This option causes a list of all new label names created to be output 1^ the file fn. 

-G fmt Set the label file format to fmt. o 

-I mlf This loads the master label file mlf. This option may be repeated to loa^^^eral MLFs. 

-P fmt Set the target label format to fmt. 

-X ext Set label file extension to ext (default is lab). 

HLEd also supports the standard options -A, -C, -D, -S, -T, and -V as described in section 4.4. 
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17.10.3 Tracing 



HLEd supports the following trace options where each trace flag is given using an octal base 

000001 basic progress reporting. 

000002 edit script details. 
000004 general command operation. 
000010 change operations. 



000040 



000020 



level spht/mer^^perations. 
delete level oper^^an. 




000100 edit file input. 



Trace flags are set using the -T opt"fm.or the TRACE configuration variable. 



000200 memory usage. 



000400 dictionary expansion in ©Pec )mmand 




CO 




o 
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17.11 HList 

17.11.1 Function 

This program will list the contents of one or more data sources in any HTK supported format. It 
uses the full HTK speech input facilities described in chapter 5 and it can thus read data from a 
waveform file, from a parameter file and direct from an audio source. HList provides a dual role 
in HTK. Firstly, it is used for examining the contents of speech data files. For this function, the 
TARGETKIND configuration variable should not be set since no conversions of the data are required. 
Secondly, it is used fo^checking that input conversions are being performed properly. In the latter 
case, a configuration creiigned for a recognition system can be used with HList to make sure that 
the translation from th^sfeurce data into the required observation structure is exactly as intended. 
To assist this, options ar^jrovided to split the input data into separate data streams (-n) and to 
explicitly list the identity each parameter in an observation (-o). 

17.11.2 Use 

HList is invoked by typing the c^j^hmnd line 

HList [options] file . . . 

This causes the contents of each f il*^ be listed to the standard output. If no files are given 
and the source format is HAUDIO, then tnAaudio source is listed. The source form of the data can 
be converted and listed in a variety of i)mge^ forms by appropriate settings of the configuration 
variables, in particular TARGETKIND^. \ 
The allowable options to HList are ''^^ 

-d Force each observation to be listed as'^^erete VQ symbols. For this to be possible the 
source must be either DISCRETE or have an associated VQ table specified via the VQTABLE 
configuration variable. 

c 

-e N End listing samples at sample index N. 

-h Print the source header information. 
-i N Print N items on each line. ^ 
-n N Display the data split into N independent data stream^^^ 

-o Show the observation structure. This identifies the role o ^ch item in each sample vector. 

-p Playback the audio. When sourcing from an audio device, ^tt^ option enables the playback 
buffer so that after displaying the sampled data, the capturev^^mdio is replayed. 

. .\S^ 

-r Print the raw data only. This is useful for exporting a file into,^i^rogram which can only 
accept simple character format data. ^-^.^ 

-s N Start listing samples from sample index N. The first sample index is 

-t Print the target header information. • 

-F fmt Set the source data format to fmt. ^^-^ 

HList also supports the standard options -A, -C, -D, -S, -T, and -V as described f^^ction 4.4. 



17.11.3 Tracing 

HList supports the following trace options where each trace fiag is given using an octal base 
00001 basic progress reporting. 

Trace fiags are set using the -T option or the TRACE configuration variable. 



^The TARGETKIND is equivalent to the HCOERCE environment variable used in earlier versions of HTK 
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17.12 HLMCopy 

17.12.1 Function 

The basic function of this tool is to copy language models. During this operation the target model 
can be optionally adjusted to a specific vocabulary, reduced in size by applying pruning parameters 
to the different n-gram components and written out in a different file format. Previously unseen 
words can be added to the language model with unigram entries supplied in a unigram probability 
file. At the same time, the tool can be used to extract word pronunciations from a number of source 
dictionaries and ontm^ a target dictionary for a specified word list. HLMCopy is a key utility 
enabling the user to CT^truct custom dictionaries and language models tailored to a particular 
recognition task. 

17.12.2 Use • 

HLMCopy is invoked by th^>^ommand line 

HLMCopy [options] inLMFfla outLMFile 

This copies the language model ^a^MFile to outLMFile optionally applying editing operations 
controlled by the following options .'''^^ 

-one Set the pruning threshold for/^grams to c. Pruning can be applied to the bigram and 
higher components of a model {rtilX The pruning procedure will keep only n-grams which 
have been observed more than c tiroes, l^ote that this option is only applicable to count-based 
language models. \ 

-d f Use dictionary f as a source of pronun^^^tions for the output dictionary. A set of dictionaries 
can be specified, in order of priority, witlvrrailtiple -d options. 

-f s Set the output language model format to sa. Possible options are TEXT for the standard 
ARPA-MIT LM format, BIN for Entropic binar^orma.t and ULTRA for Entropic ultra format. 

-n n Save target model as rt-gram. 

-m Allow multiple identical pronunciations for a singl^^ord. Normally identical pronunciations 
are deleted. This option may be required when ^^^ngle word/pronunciation has several 
different output symbols. 

-o Allow pronunciations for a single word to be selected froSlgiultiple dictionaries. Normally the 
dictionaries are prioritised by the order they appear on^m^ommand line with only entries 
in the first dictionary containing a pronunciation for a particular word being copied to the 
output dictionary. ^-^ 



-u f Use unigrams from file f as replacements for the ones in trt^kinguage model itself. Any 
words appearing in the output language model which have entrifeski the unigram file (which 
is formatted as LOGIOPRDB WORD) use the likehhood (loglOCp^fobi) from the unigram file 
rather than from the language model. This allows simple language ^ij)del adaptation as well 
as allowing unigram probabilities to be assigned words in the output (^^abulary that do not 
appear in the input language model. In some instances you may wish to use LNORM to 
renormalise the model after using -u. 

-V f Write a dictionary covering the output vocabulary to file f . If any requirec^^ords cannot be 
found in the set of input dictionaries an error will be generated. 



-w f Read a word-list defining the output vocabulary from f . This will be used to select the 
vocabulary for both the output language model and output dictionary. 

HLMCopy also supports the standard options -A, -C, -D, -S, -T, and -V as described in section 4.4. 
17.12.3 Tracing 

HLMCopy supports the following trace options where each trace fiag is given using an octal base 
00001 basic progress reporting. 

Trace fiags are set using the -T option or the TRACE configuration variable. 
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17.13 HLRescore 
17.13.1 Function 

HLRescore is a general lattice post-processing tool. It reads lattices (for example produced by 
HVite) and applies one of the following operations on them: 

• finding 1-best path through lattice 

• pruning lattice using forward-backward scores 

• expanding latti^^^ith new language model 

• converting lattices ^_gquivalent word networks 

• calculating various lattice statistics 

• converting word MLF ^'^^o lattices with a language model 

A typical scenario for the useC^ HLRescore is the application of a higher order n-gram to the 
word lattices generated with HVit)g'Mid a bigram. This would involve the following steps: 

• lattice generation with HVite uSmg a bigram 



• lattice pruning with HLRescore ("'^^ 

• expansion of lattices using a trigram\^n) 

• finding 1-best transcription in the exp^^^ lattice (-f ) 

Another common use of HLRescore is the tii^i^g of the language model scaling factor and the 
word insertion penalty for use in recognition. In^^^d of having to re-run a decoder many times 
with different parameter settings the decoder is rup^nce to generate lattices. HLRescore can 
be used to find the best transcription for a give parameter setting very quickly. These different 
transcriptions can then be scored (using HRESULTs)v«ijjd the parameter setting that yields the 
lowest word error rate can be selected. C 

Lattices produced by standard HTK decoders, for ex^^ple, HVite and HDecode, may still 
contain duplicate word paths corresponding to different phonetic contexts caused by pronunciation 
variants or optional between word short pause silence models^vlhese duplicate lattice nodes and 
arcs must be merged to ensure that the finite state grammarxr^ted from the lattices by HTK 
decoders are deterministic, and therefore usable for recognition .\This function is also supported by 
HLRescore. O 

17.13.2 Use ti^ 

HLRescore is invoked via the command line 

O 

HLRescore [options] vocabFile LatFiles 

HLRescore reads each of the lattice files and performs the requested ope]»ation(s) on them. At 
least one of the following operations must be selected: find 1-best (-f), write lat^Cn^s (-w), calculate 
statistics (-c). 

The detailed operation of HLRescore is controlled by the following commamj^kne options 
-i mlf Output transcriptions to master file mlf . 
-1 s Directory in which to store label/lattice files. 

-m s Direction of merging duplicate nodes and arcs of lattices. The default value is b, indicating 
a merging in a backward direction starting from the sentence end node of the lattice will be 
performed. If using direction f , then the forward merging will be performed instead. 

-n Im Load ARPA-format n-gram language model from file Im and expand lattice with this LM. 
All acoustic scores are unchanged but the LM scores are replaced and lattices nodes (i.e. 
contexts) are expanded as required by the structure of the LM. 
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-wn Im Load Resource Management format word pair language model from file Im and apply this 
LM to a lattice converted from a word MLF file. 

-o s Choose how the output labels should be formatted, s is a string with certain letters (from 
NSCTWM) indicating binary flags that control formatting options. N normalize acoustic scores 
by dividing by the duration (in frames) of the segment. S remove scores from output label. 
By default scores will be set to the total likelihood of the segment. C Set the transcription 
labels to start and end on frame centres. By default start times are set to the start time 
of the frame and end times are set to the end time of the frame. T Do not include times 
in output label^es. W Do not include words in output label files when performing state or 
model alignmenrTl Do not include model names in output label files. 

-t f [a] Perform lattic/^^uning after reading lattices with beamwidth f . If second argument is 
given lower beam to limit arcs per second to a. 

-u f Perform lattice prunin|;^efc)re writing output lattices. Otherwise like -t. 

-p f Set the word insertion logr^^bability to f (default 0.0). 

-a f Set the acoustic model scalcy^tor to f . (default value 1.0). 

-r f Set the dictionary pronunciati((^)probability scale factor to f . (default value 1.0). 

-s f Set the grammar scale factor to r^Tihis factor post-multiplies the language model likelihoods 
from the word lattices, (default vai«e^JD). 

-d Take pronunciation probabilities from^^^ dictionary instead of from the lattice. 

-c Calculate and output lattice statistics, tj^ 

-f Find 1-best transcription (path) in lattice.^ 

-w Write output lattice after processing. 

-q s Choose how the output lattice should be forma^iAL s is a string with certain letters (from 
ABtvaldmn) indicating binary fiags that control fcmnatting options. A attach word labels to 
arcs rather than nodes. B output lattices in binary ijfl?)speed. t output node times, v output 
pronunciation information, a output acoustic likelinoods. 1 output language model likeli- 
hoods, d output word alignments (if available), m oufpu^within word alignment durations, 
n output within word alignment likelihoods. . 

-y ext This sets the extension for output label files to ext (deia^^ rec). 

-F fmt Set the source data format to fmt. 

-G fmt Set the label file format to fmt. ^^jl 

-H mmf Load HMM macro model file mmf . This option may be repeated ^O^oad multiple MMFs. 

-I mlf This loads the master label file mlf . This option may be repeated (onload several MLFs. 

-J dir [ext] Add directory to the list of possible input transform directories^ Only one of the 
options can specify the extrension to use for the input transforms. 

-K dir [ext] Output transform directory and optional extension for output tifAsforms. The 
default option is that there is no output extension and the current transform difectoryis used. 

-P fmt Set the target label format to fmt. 



HLRescore also supports the standard options -A, -C, -D, -S, -T, and -V as described in sec- 
tion 4.4. 
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17.13.3 Tracing 



HLRescore supports the following trace options where each trace flag is given using an octal base 

0001 enable basic progress reporting. 

0002 output generated transcriptions. 
0004 show details of lattice I/O 
0010 show memory us^ge after each lattice 



Trace flags are set usin^he -T option or the TRACE configuration variable. 





CO 




o 
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17.14 HLStats 
17.14.1 Function 

This program will read in a HMM list and a set of HTK format transcriptions (label files) . It will 
then compute various statistics which are intended to assist in analysing acoustic training data and 
generating simple language models for recognition. The specific functions provided by HLStats 
are: 

1. number of occurrences of each distinct logical HMM and/or each distinct physical HMM. The 
list printed can'^ei limited to the N most infrequent models. 

2. minimum, maxini^^and average durations of each logical HMM in the transcriptions. 

3. a matrix of bigram pi^obabilities 

4. an ARPA/MIT-LL forij^ ^ext file of back-off bigram probabilities 

*% 

When using the bigram generating opilTxns, each transcription is assumed to have a unique entry 
and exit label which by default are IEM3ER and !EXIT. If these labels are not present they are 
inserted. In addition, any label occurringnn^ transcription which is not listed in the HMM list is 
mapped to a unique label called !NULL. \ 

HLStats processes all input transcriptfrai and maps all labels to a set of unique integers in 
the range 1 to L, where L is the number of di^inct labels. For each adjacent pair of labels i and 
j, it counts the total number of occurrences N{i^f^. Let the total number of occurrences of label i 
beiV(i) = E^=i^(z,j). ^ ^ ^ v'^ 

For matrix bigrams, the bigram probability j)^s given by 



5. a list of labels which cove¥^^e given set of transcriptions. 
17.14.2 Bigram Generatit 



aN{i,j)/N{i^ii N{i) > 0 
Pii,j)^{ l/L ^^N{i) = 0 

f (^erwise 

where / is a floor probability set by the -f option and a is Shosgn to ensure that ^^^iP{i,j) — 1- 



For back-off bigrams, the unigram probablities p{i) are gi^i^j) by 

r N{^)/N if7V(z)>yl 
' I II. /N otherwise ^ 



u/N otherwise 

where u is unigram floor count set by the -u option and N — rii^[^(i),u]. 
The backed-off bigram probabilities are given by 



_ r (iV(z, j) - D)IN{i) if N{i,j) > t O 
^^'■''~\h{i)p{j) otherwise O 



where Z? is a discount and t is a bigram count threshold set by the -t option./-^he discount D is 
flxed at 0.5 but can be changed via the configuration variable DISCOUNT. The oapk-off weight h{i) 
is calculated to ensure that X]t=iP(*'j) ~ ^-^^ 

.... l-EjsaP(^:j) 

= ^1 rV 



where B is the set of all words for which p{i,j) has a bigram. 

The formats of matrix and ARPA/MIT-LL format bigram files are described in Chapter 12. 
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17.14.3 Use 

HLStats is invoked by the command line 

HLStats [options] hmmList labFiles .... 

The hmmList should contain a list of all the labels (ie model names) used in the following label files 
for which statistics are required. Any labels not appearing in the list are ignored and assumed to be 
out-of-vocabulary. The list of labels is specified in the same way as for a HMM list (see HModel) 
and the logical=^> physical mapping is preserved to allow statistics to be collected about physical 
names as well as logie:S_,^ones. The labFiles may be master label files. The available options are 




b f n Compute bigran^st^tistics and store result in the file fn. 



-c N Count the number of occurrences of each logical model listed in the hmmList and on completion 
list all models for which J.here are N or less occurrences. 

-d Compute minimum, maj5fm|am and average duration statistics for each label. 

-f f Set the matrix bigram >bability to f (default value 0.0). This option should be used 

in conjunction with the -b omjbn. 

-h N Set the bigram hashtable size ^^!Haedium(N=l) or large (N=2). This option should be used 
in conjunction with the -b optiok.vThe default is small(N=0). 

-1 f n Output a list of covering labels to'nl^f n. Only labels occurring in the labList are counted 
(others are assumed to be out-of-vocabulary). However, this list may contain labels that do 
not occur in any of the label files. They^st of labels written to f n will however contain only 
those labels which occur at least once. 

-o Produce backed-off bigrams rather than ^natox ones. These are output in the standard 
ARPA/MIT-LL textual format. 

-p N Count the number of occurrences of each physical model listed in the hmmList and on com- 
pletion list all models for which there are N or les^s^currences. 

-s St en Set the sentence start and end labels to st and^^. (Default ! ENTER and !EX1T). 

-t n Set the threshold count for including a bigram in a b^ckoi-ofF bigram language model. This 
option should be used in conjunction with the -b and -^q^P^ioiis. 

-u f Set the unigram floor probability to f when constructing\!^tack-ofF bigram language model. 
This option should be used in conjunction with the -b and options. 

-G fmt Set the label file format to fmt. 

-I mlf This loads the master label file mlf . This option may be repeff^J to load several MLFs. 



HLStats also supports the standard options -A, -C, -D, -S, -T, and -V as oe^^ribed in section 4.4. 

17.14.4 Tracing *^ 

HLStats supports the following trace options where each trace flag is given usii^^n octal base 

00001 basic progress reporting. 

00002 trace memory usage. 

00004 show basic statistics whilst calculating bigrams. This includes the global training data 
entropy and the entropy for each distinct label. 

00010 show file names as they are processed. 

Trace flags are set using the -T option or the TRACE configuration variable. 
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17.15 HMMIRest 

17.15.1 Function 

This program is used to perform a single re-estimation of a HMM set using a discriminative training 
criterion. Maximum Mutual Information (MMI), Minimum Phone Error (MPE) and Minimum 
Word Error (MWE) are supported. Gaussian parameters are re-estimated using the Baum- Welch 
update. 

Discriminative training needs to be initialised with trained models; these will generally be trained 
using MLE. Also essential are lattices. Two sets of lattices are needed: a lattice for the correct 
transcription of each waning file, and a lattice derived from the recognition of each training file. 
These are called the '^»jVnerator" and "denominator" lattices respectively, a reference to their 
respective positions in th^MMI objective function, which can be expressed as follows: 

tJ^MMl(A) = El»g pK(0|A,den) ' (17-4) 

Where A is the HMM parameterV^ is a season the likehhoods, is the speech data for the .'th 
training file and Al""™ and A^j!" yeCre the numerator and denominator models. 

The numerator and denominatoflffittices need to contain phone alignment information (the start 
and end times of each phone) and lariguage model likelihoods. Phone alignment information can be 
obtained by using HDecode with thewtter d included in the lattice format options (specified to 
HDecode by the -q command line optira^. It is important that the language model applied to the 
lattice not be too informative (e.g., use a ^^gram LM) and that the pruning beam used to create 
it be large enough (probably greater than lQOf\ Typically an initial set of lattices would be created 
by recognition using a bigram language mod^/and a pruning beam in the region of 125-200 (e.g. 
using HDecode, and this would then have a rmigiam language model applied to it and a pruning 
beam applied in the region 100-150, using the to^. HLRescore. The phone alignment information 
can then be added using HDecode. 

Having created these lattices using an initial set^<S models, the tool HMMIRest can be used 
to re-estimate the HMMs for typically 4-8 iterations, iiAng the same set of lattices. 

HMMIRest supports multiple mixture Gaussians/^Cip^-mixture HMMs, multiple data streams, 
parameter tying within and between models (but not if misans and variances are tied independently) , 
and diagonal covariance matrices only. 

Like HERest, HMMIRest includes features to allow ^)arallel operation where a network of 
processors is available. When the training set is large, it ca*f^)e split into separate chunks that 
are processed in parallel on multiple machines/processors, speeoiji^ up the training process. HM- 
MIRest can operate in two modes: \ 

1. Single-pass operation: the data files on the command line a^hthe speech files to be used 
for re-estimation of the HMM set; the re-estimated HMM is^^rrkten by the program to the 
directory specified by the -M option. 

2. Two-pass operation with data split into P blocks: 

(a) First pass: multiple jobs are started with the options -p 1, P are given; the 
different subsets of training files should be given on each command lijie or in a file specified 
by the -S option. Accumulator files <dir>/HDR<p> . acc . 1, <dir>/^M,<p> . acc . 2 and 
(for MPE) <dir>/HDR<p> . acc . 3 are written, where <dir> is the direet«ry specified by 
the -M and <p> is the number specified by the -p option. 

(b) Second pass: HMMIRest is called with -p 0, and the command-line argttments are the 
accumulator files written by the first pass. The re-estimated HMM set is written to the 
directory specifified by the -M option. 

17.15.2 Use 

HMMIRest is invoked via the command line 

HMMIRest [options] himnList trainFile . . . 



or alternatively if the training is in parallel mode, the second pass is invoked as 
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HMMIRest [options] hmmList accFile . . . 

As always, the list of arguments can be stored in a script file (-S option) if required. 
Typical use 

To give an example of typical use, for single-pass operation the usage might be: 

HMMIRest -A -V -D -H <startdir>/MMF -M <enddir> -C <config> -q <num-lat-dir> 
-r <den-lat-dir> -S <traindata-list-f ile> <hininlist> 

For two-pass operation^he first pass would be (for the n th training file), 

HMMIRest -p <p> -A -"^^J? -H <startdir>/MMF -M <enddir> -C <config> -q <nmn-lat-dir> 
-r <den-lat-dir> -S <p'th-traindata-list-f ile> <hininlist> 



and the second pass would . 

-A -V -D -H <st 



HMMIRest -p 0 -A -V -D -H <sta.rtdir>/MMF -M <enddir> -C <config> -q <nmn-lat-dir> 
-r <den-lat-dir> <hinmlisV^5enddir>/HDRl . acc. 1 <enddir>/HDRl . acc. 2 
<enddir>/HDR2 . acc . 1 <end<Ji^/HDR2 . acc .2 ... <enddir>/HDR<P> . acc . 2 

This is for MMI training; for MPE taVufi will be three files HDRn . acc . 1 , HDRn . acc . 2 , HDRn . acc . 3 
per block of training data. ISMOOTHTAkVan also be tuned. 



Typical config variables 



The most important config variables are giv^ as follows for a number of example setups. MMI 
training is the default behaviour. 

1. I-smoothed MMI training: set ISMOOTHTAl^^Op. 

2. MPE training ("approximate-accuracy" , which^efault kind of MPE): set MPE=TRUE, ISM00THTAU=50 

3. "Exact" MPE training: set MPE=TRUE, EXACTCOR^^ESS=TRUE, INSCORRECTNESS=-0 . 9 (this 
may have to be tuned in the range -0.8 to -1; it afi^cts the testing insertion/deletion ratio) 

4. "Approximate-error" MPE training (recommended) : ^t MPE=TRUE, CALCASERROR=TRUE, INSCORRECTNESS=-0 . 
(this may have to be tuned in the range -0.8 to -1; it affects the testing insertion/deletion 

ratio) \^ 

5. MWE training: set MWE=TRUE, ISMDDTHTAU=25 

In all cases set LATPROBSCALE=<f >, where <f> will normally P)^in the range 1/10 to 1/20, 
typically the inverse of the language model scale. 

Storage of lattices ^"^^ 

If there are great many training files, the directories specified for the numei^or and denominator 
lattice can contain subdirectories containing the actual lattices. The name of the subdirectory 
required can be extracted from the filenames by adding config variables for iiTstance as follows: 

LATMASKNUM = */yX/,????? . ??? ^^Vn 
LATMASKDEN = */yX/,????? . ??? 

which would convert a fileneme f oo/barl2345 . lat to <dir>/bar/barl2345 . lat, where <dir> 
would be the directory specified by the -q or -r option. The 7,'s are converted to the directory 
name and other characters are discarded. If lattices are gzipped in order to save space, they can be 
read in by adding the config 



HNETFILTER='gunzip -c < $.gz' 



17.15 HMMIRest 



279 



17.15.3 Command-line options 

The full list of the options accepted by HMMIRest is as follows. Options not ever needed for 
normal operation are marked "obscure" . 

-d dir (obscure) Normally HMMIRest looks for HMM definitions (not already loaded via MMF 
files) in the current directory. This option tells HMMIRest to look in the directory dir to 
find them. 

-g (obscure) Maximum Likelihood (ML) mode- Maximum Likelihood lattice-based estimation 
using only the i^iperator (correct-sentence) lattices. 

-1 (obscure) Ma,xi&(m number of sentences to use (useful only for troubleshooting) 




-o ext (obscure) This causes the file name extensions of the original models (if any) to be replaced 
by ext. * 

-p N Set parallel mode. 1, 2yr<^N for a forward-backward pass on data split into N blocks. 0 for 
re-estimating using accummator files that this will write. Default (-1) is single-pass operation. 

-q dir Directory to look for nurrhCT^tor lattice files. These files will have a filename the same as 
the speech file, but with exten^^y lat (or as given by -Q option). 

-qp s Path pattern for the extended ^^h to look for numerator lattice files. The matched string 
will be spliced to the end of the dir^tory string specified by option '-q' for a deeper path. 

-r dir Directory to look for denominator^ttice files. 

-rp s Path pattern for the extended path to'toek for denominator lattice files. The matched string 
will be spliced to the end of the directory^^i^ng specified by option '-r' for a deeper path. 

-s file File to write HMM statistics. 

-twodataf iles (obscure) Expect two of each data ^l^for single pass retraining. This works as for 
HERest; command line contains alternating alig(l^ent and update files. 

-u flags By default, HMMIRest updates all of the H^^l parameters, that is, means, variances, 
mixture weights and transition probabilies. This op^^ii causes just the parameters indicated 
by the flags argument to be updated, this argumenl^ is a string containing one or more of 
the letters m (mean), v (variance) , t (transition) and w^raixture weight). The presence of a 
letter enables the updating of the corresponding parameCCT^t. 

-umle flags (obscure) A format as in -u; directs that the specified parameters be updated using 



the ML criterion (for MMI operation only, not MPE/MWE)^^Q^ 

-w floor (obscure) Set the mixture weight floor as a multiple of MIlt{P-X=l . Oe-5. Default is 2. 

-X ext By default, HMMIRest expects a HMM definition for the iaoa^ to be stored in a file 
called X. This option causes HMMIRest to look for the HMM defiWioji in the file X.ext. 

-B Output HMM definition files in binary format. 

-F fmt Set the source data format to fmt. ^\ 

-H mmf Load HMM macro model file mmf . This option may be repeated to load ^iple MMFs. 

-Hprior MMF Load HMM macro file MMF, to be used as prior model in adaptation. Wse with config 
variables PRIORTAU etc. 

-1 mlf This loads the master label file mlf . This option may be repeated to load several MLFs. 

-M dir Store output HMM macro model files in the directory dir. If this option is not given, the 
new HMM definition will overwrite the existing one. 

-Q ext Set the lattice file extension to ext 

HMMIRest also supports the standard options -A, -C, -D, -S, -T, and -V as described in 
section 4.4. 
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17.15.4 Tracing 

The command-line trace option can only be set to 0 (trace off) or 1 (trace on), which is the default. 
Tracing behaviour can be altered by the TRACE configuration variables in the modules HArc and 
HFBLat. 




17.16 



HParse 



281 



17.16 HParse 

17.16.1 Function 

The HParse program generates word level lattice files (for use with e.g. HVite) from a text file 
syntax description containing a set of rewrite rules based on extended Backus-Naur Form (EBNF). 
The EBNF rules are used to generate an internal representation of the corresponding finite-state 
network where HPARSE network nodes represent the words in the network, and are connected via 
sets of links. This HParse network is then converted to HTK V2 word level lattice. The program 
provides one convenie^ way of defining such word level lattices. 

HParse also pro\v^ a compatibility mode for use with HParse syntax descriptions used in 
HTK VI. 5 where the s£®^ format was used to define both the word level syntax and the dictionary. 
In compatibility mode H^iiSE will output the word level portion of such a syntax as an HTK V2 
lattice file (via HNet) and Jhe pronuciation information as an HTK V2 dictionary file (via HDict). 

The lattice produced by HR\.RSE will often contain a number of ! NULL nodes in order to reduce 
the number of arcs in the laracft^ The use of such ! NULL nodes can both reduce size and increase 
efficiency when used by recogm^n programs such as HVite. 

17.16.2 Network Definit!pii 

The syntax rules for the textual defi^^^n of the network are as follows. Each node in the network 
has a nodencune. This node name willVBmially correspond to a word in the final syntax network. 
Additionally, for use in compatibility moM, each node can also have an external name. 

v>' 

name = charjchar} ^^^i. 

nodename = name [ "%" ( 'v%" | name ) ] 

Here char represents any character except one p^^the meta chars {}[]<>| = $();\/ 
*. The latter may, however, be escaped using a Imcl^kish. The first name in a nodename represents 
the name of the node ("internal name"), and theCsecond optional name is the "external" name. 
This is used only in compatibility mode, and is, by ^eiault the same as the internal name. 
Network definitions may also contain variables 

variable = $name 

Variables are identified by a leading $ character. They stand for sub-networks and must be defined 
before they appear in the RHS of a rule using the form 

subnet = variable "=" expr ";" 

An expr consists of a set of alternative sequences representing paC^el branches of the network. 

expr = sequence {"I" sequence} ''^^Vl 

sequence = factorj factor} '<'^) 

Each sequence is composed of a sequence of factors where a factor is eitheQ node name, a variable 
representing some sub-network or an expression contained within various s^^ of brackets. 



factor = "(" expr ")" 

"{" expr "}" 
"<" expr ">" 
"[" expr "]" 
"<<" expr ">>" 
nodename 
variable 



o 

% 



Ordinary parentheses are used to bracket sub-expressions, curly braces { } denote zero or more 
repetitions and angle brackets <> denote one or more repetitions. Square brackets [] are used to 
enclose optional items. The double angle brackets are a special feature included for building context 
dependent loops and are explained further below. Finally, the complete network is defined by a list 
of sub-network definitions followed by a single expression within parentheses. 



network 



= {subnet} "(" expr ")' 
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Note that C style comments may be placed anywhere in the text of the network definition. 
As an example, the following network defines a syntax for some simple edit commands 

$dir = up I down I left I right; 

$iiivciiid = move $dir I top I bottom; 

$item = char I word I line I page; 

$dlcmd = delete [$item] ; /* default is char */ 

$incmd = insert ; 

$encmd = end [insert] ; 

$cmd = $mvcmd| $^lcmd| $incmd| $encmd; 

(■[sil> < $cmd {^pj- > quit) 

Double angle bracke^are used to construct contextually consistent context-dependent loops 
such as a word-pair granmiar.^'^ This function can also be used to generate consistent triphone 
loops for phone recognitiorf'^^. The entry and exit conditions to a context-dependent loop can be 
controlled by the invisible ps^^o-words TLOOP_BEGIN and TLOOP_END. The right context of 
TLOOP_BEGIN defines the leg^loop start nodes, and the left context of TLOOP_END defines the 
legal loop finishers. If TLOOP_BKSIN/TLOOP_END are not present then all models are connected 
to the entry/exit of the loop. \^ 

A word-pair grammar simply dernnes the legal set of words that can follow each word in the 
vocabulary. To generate a network Hjfv represent such a grammar a right context-dependent loop 
could be used. The legal sentence set ^sentence start and end words are defined as above using 
TLOOPJBEGIN/TLOOP_END. 

For example, the following lists the legakfoUowers for each word in a 7 word vocabulary 

ENTRY - show, tell, give 

show - me, all 

tell - me, all ^ 

me - all ^ 

all - names, addresses ^ v' 

names - and, names, addresses, show, (tell, EXIT 

addresses - and, names, addresses, show, t^!>EXIT 

and - names, addresses, show, tell 

HParse can generate a suitable lattice to represent t^ word-pair grammar by using the fol- 
lowing specification: , 

$TLODP_BEGIN_FLLWRS = show I tell I give ; X5 

$TLODP_END_PREDS = names I addresses ; \^ 

$show_FLLWRS = me I all; r\ 

$tell_FLLWRS = me I all; CX 

$me_FLLWRS = all; 

$all_FLLWRS = names I addresses ; ^ \> 

$names_FLLWRS = and I names I addresses I show I tell I TLOSlQlND; 

$addresses_FLLWRS = and I names I addresses I show I tell I TLOOP^^^; 

$cLnd_FLLWRS = names I addresses I show I tell ; 



( sil « • 

TLODP_BEGIN+TLaOP_BEGIN_FLLWRS I O 
TLODP_END_PREDS-TLDOP_END I Q 
show+show_FLLWRS I 
tell+tell_FLLWRS I 
me+me_FLLWRS I 
all+all_FLLWRS I 
names+names_FLLWRS I 
addresses+addresses_FLLWRS I 
and+and_FLLWRS 
» sil ) 



'^'^The expression between double angle brackets must be a simple list of alternative node names or a variable which 
has such a list as its value 

^^In HTK V2 it is preferable for these context-loop expansions to be done automatically via HNet, to avoid 
requiring a dictionary entry for every context-dependent model 
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where it is assumed that each utterance begins and ends with sil model. 

In this example, each set of contexts is defined by creating a variable whose alternatives are the 
individual contexts. The actual context-dependent loop is indicated by the << » brackets. Each 
element in this loop is a single variable name of the form A-B+C where A represents the left context, 
C represents the right context and B is the actual word. Each of A, B and C can be nodcnames or 
variable names but note that this is the only case where variable names are expanded automatically 
and the usual $ symbol is not uscd^'^. Both A and C arc optional, and left and right contexts can 
be mixed in the same triphone loop. 



17.16.3 Comp^^ility Mode 

In HParse compatibilifv^ode, the interpretation of the ENBF network is that used by the HTK 
VI. 5 HViTE program, in^hich HParse ENBF notation was used to define both the word level 
syntax and the dictionary. •Compatibility mode is aimed at converting files written for HTK VI. 5 
into their equivalent HTK V^^presentation. Therefore HParse will output the word level portion 
of such a ENBF syntax as an ^f}^K V2 lattice file and the pronunciation information is optionally 
stored in an HTK V2 dictionare^e. When operating in compatibility mode and not generating 
dictionary output, the pronunciacroiv^nformation is discarded. 

In compatibility mode, the resei;jS,4d node names WD_BEGIN and WD_END are used to delimit word 
boundaries — nodes between a WD_BEG^/WD_END pair are called "word-internal" while all other nodes 
are "word-external" . All WD_BEGIN/W^^ND nodes must have an "external name" attached that 
denotes the word. It is a requirement ptt&t the number of WD_BEGIN and the number of WD_END 
nodes are equal and furthermore that tlfea isn't a direct connection from a WD_BEGIN node to a 
WD_END. For example a portion of such an mTK VI. 5 network could be 

$A 

$ABDOMEN 
$ABIDES 
$ABOLISH 
. . . etc 




WD_BEGIN7.A ax WD_ENC 
WD_BEGIN7.ABD0MEN ae ^±>^x m ax n WD_ENDy.ABDOMEN ; 
WD_BEGIN7.ABIDES ax b a^ d z WD_END7.ABIDES ; 
WD_BEGIN7.AB0LISH ax b a^ ih sh WD_END7.AB0LISH; 



( < 



$A I $ABDOMEN | $ABIDES I $ABOLISH | 



> ) 



TI^2 



HParse will output the connectivity of the words in an HTK-V2>word lattice format file and the 
pronunciation information in an HTK V2 dictionary. Word-exfernal nodes are treated as words 
and stored in the lattice with corresponding entries in the diction(2^- 

It should be noted that in HTK VI. 5 any ENBF network could ap^r between a WD_BEGIN/WD_END 
pair, which includes loops. Care should therefore be taken with syntaj^ that define very complex 
sets of alternative pronunciations. It should also be noted that each mc^ionary entry is limited in 
length to 100 phones. If multiple instances of the same word are founa^ the expanded HParse 
network, a dictionary entry will be created for only the first instance andVsiibgequent instances are 
ignored (a warning is printed). If words with a NULL external name are prei^ent then the dictionary 
will contain a NULL output symbol. ^ 

Finally, since the implementation of the generation of the HParse networl 
the semantics of variable definition and use has been slightly changed. Previoi 
be redefined during network definition and each use would follow the most rec^ 
HTK V2 only the final definition of any variable is used in network expansion. 




s been revised^"^ 
variables could 
■definition. In 



tHjiue 



17.16.4 Use 

HParse is invoked via the command line 
HParse [options] syntaxFile latFile 

^^If the base-names or left/right context of the context-dependent names in a context-dependent loop are variables, 
no $ symbols are used when writing the context-dependent nodename. 



13 



With the added benefit of rectifying some residual bugs in the HTK VI. 5 implementation 
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HParse will then read the set of ENBF rules in syntaxFile and produce the output lattice in 
latFile. 

The detailed operation of HParse is controlled by the following command line options 

-b Output the lattice in binary format. This increases speed of subsequent loading (default 
ASCn text lattices). 

-c Set VI. 5 compatibility mode. Compatibility mode can also be set by using the configuration 
variable VICOMPAT (default compatibility mode disabled). 

-d s Output dictions^5^to file s. This is only a valid option when operating in compatibility mode. 
If not set no dicti^ary will be produced. 

-1 Include language rrf^el log probabilities in the output These log probabilities are calculated 
as — log(number of followers) for each network node. 

HParse also supports the stkSk^d options -A, -C, -D, -S, -T, and -V as described in section 4.4. 



HParse supports the following tral 



17.16.5 Tracing 




itions where each trace flag is given using an octal base 
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17.17 HQuant 

17.17.1 Function 

This program will construct a HTK format VQ table consisting of one or more codebooks, each 
corresponding to an independent data stream. A codebook is a collection of indexed reference 
vectors that are intended to represent the structure of the training data. Ideally, every compact 
cluster of points in the training data is represented by one reference vector. VQ tables are used 
directly when building systems based on Discrete Probability HMMs. In this case, the continuous- 
valued speech vectors^n each stream are replaced by the index of the closest reference vector in 
each corresponding cwl^ook. Codebooks can also be used to attach VQ indices to continuous 
observations. A typicEuZttee of this is to preselect Gaussians in probability computations. More 
information on the use o^^£^ tables is given in section 5.14. 

Codebook construction^ consists of finding clusters in the training data, assigning a unique 
reference vector (the cluster eentroid) to each, and storing the resultant reference vectors and 
associated codebook data inM^VQ table. HQuant uses a top-down clustering process whereby 
clusters are iteratively split untjr the desired number of clusters are found. The optimal codebook 
size (number of reference vector ^en ends on the structure and amount of the training data, but a 
value of 256 is commonly used. 

HQuant can construct both lineaT^i.e. flat) and tree-structured (i.e. binary) codebooks. Linear 
codebooks can have lower quantisatit>»Nerrors but tree-structured codebooks have logj N access 
times compared to N for the linear case^* The distance metric can either be Euclidean, diagonal 
covariance Mahalanobis or full covarianc^/Mahalanobis. 

17.17.2 VQ Codebook Format ^ 



where magic is a magic number which is usually the code idf the parameter kind of the data. The 
type defines the type of codebook 



Externally, a VQ table is stored in a text file coirsi^ing of a header followed by a sequence of entries 
representing each tree node. One tree is built foixeach stream and linear codebooks are represented 
by a tree in which there are only right branches. 

The header consists of a magic number followed l^^^he covariance kind, the number of following 
nodes, the number of streams and the width of each s^^am. 

header = magic type covkind numNodes nui^^swidthl swidth2 . . . 

'-6 

type = linear (0) , binary tree-structured (1) 

The covariance kind determines the type of distance metric to be(u^ed 

covkind — diagonal covariance (1), full covariance (2), eViSi^ean (5) 

Within the file, these covariances are stored in inverse form. yi) 
Each node entry has the following form 

node-entry — stream vqidx nodeld leftid rightid 

mean- vector ^ 
[inverse-covariance-matrix | inverse-variance-vector] 

Stream is the stream index for this entry. Vqidx is the VQ index corresponding to entry. This is 
the number that appears in vector quantised speech files. In tree-structured code->JTObks, it is zero 
for non-terminal nodes. Every node has a unique integer identifier (distinct from tne VQ index) 
given by nodeld. The left and right daughter of the current node are given by leftid and rightid. In 
a linear codebook, the left identifier is always zero. 

Some examples of VQ tables are given in Chapter 11. 

17.17.3 Use 

HQuant is invoked via the command line 



HQuant [options] vqFile trainFiles 
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where vqFile is the name of the output VQ table file. The effect of this command is to read in 
training data from each trainFile, cluster the data and write the final cluster centres into the VQ 
table file. 

The list of training files can be stored in a script file if required. Furthermore, the data used for 
training the codebook can be limited to that corresponding to a specified label. This can be used, 
for example, to train phone specific codebooks. When constructing a linear codebook, the maximum 
number of iterations per cluster can be limited by setting the configuration variable MAXCLUSTITER. 
The minimum number of samples in any one cluster can be set using the configuration variable 
MINCLUSTSIZE. 

The detailed oper^pn of HQuant is controlled by the following command line options 

-d Use a diagonal-cS^^iance Mahalonobis distance metric for clustering (default is to use a 
Euclidean distance 'Metric). 

-f Use a fuU-covariance Riahalonobis distance metric for clustering (default is to use a Euclidean 
distance metric). ^ 

-g Output the global covariaSpe'^o a codebook. Normally, covariances are computed individually 
for each cluster using the dataj^ that cluster. This option computes a global covariance across 
all the clusters. 

-1 s The string s must be the namft^ a segment label. When this option is used, HQuant 
searches through all of the training^les and uses only the speech frames from segments with 
the given label. When this option i^^ot used, HQuant uses all of the data in each training 
file. 

-n S N Set size of codebook for stream s'hd^lJ (default 256). If tree-structured codebooks are 
required then N must be a power of 2. '^J \^ 



-s N Set number of streams to N (default 1). Unress^the -w option is used, the width of each stream 
is set automatically depending on the size ami^ji^rameter kind of the training data. 

-t Create tree-structured codebooks (default linear^^ 



-w S N Set width of stream S to N. This option overrides the default decomposition that HTK 
normally uses to divide a parameter file into stre^^. If this option is used, it must be 
repeated for each individual stream. 

-F fmt Set the source data format to fmt. \^ 

-G fmt Set the label file format to fmt. '^'^'^^ 

-I mlf This loads the master label file mlf . This option may be r^g^^ted to load several MLFs. 
-L dir Search directory dir for label files (default is to search curretis directory) . 
-X ext Set label file extension to ext (default is lab). 

HQuant also supports the standard options -A, -C, -D, -S, -T, and -V as de^^ibed in section 4.4. 

• 

17.17.4 Tracing ^ 

HQuant supports the following trace options where each trace flag is given usinQin octal base 

00001 basic progress reporting. 

00002 dump global mean and covariance 
00004 trace data loading. 

00010 list label segments. 
00020 dump clusters. 
00040 dump VQ table. 



Trace flags are set using the -T option or the TRACE configuration variable. 
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17.18 HRest 

17.18.1 Function 

HRest performs basic Baum- Welch re-estimation of the parameters of a single HMM using a set 
of observation sequences. HRest can be used for normal isolated word training in which the 
observation sequences are realisations of the corresponding vocabulary word. 

Alternatively, HRest can be used to generate seed HMMs for phoneme-based recognition. In 
this latter case, the observation sequences will consist of segments of continuously spoken training 
material. HRest wilL^wi these out of the training data automatically by simply giving it a segment 
label. 

In both of the above^Egfplications, HRest is intended to operate on HMMs with initial parameter 
values estimated by HlNfj^ 

HRest supports multij)le mixture components, multiple streams, parameter tying within a 
single model, full or diagonal ^variance matrices, tied-mixture models and discrete models. The 
outputs of HRest are often Mjl^er processed by HERest. 

Like all re-estimation toolsy^REST allows a floor to be set on each individual variance by 
defining a variance floor macro e^ch data stream (see chapter 8) . If any diagonal covariance 
component falls below 0.00001, ths^he corresponding mixture weight is set to zero. A warning is 
issued if the number of mixtures is grjS^er than one, otherwise an error occurs. Applying a variance 
floor via the -v option or a variance H^^' macro can be used to prevent this. 

17.18.2 Use ^ 

HRest is invoked via the command line 

HRest [options] hmm trainFiles . . . 

This causes the parameters of the given hmm ^o^be re-estimated repeatedly using the data in 
trainFiles until either a maximum iteration limil i&jreached or the re-estimation converges. The 
HMM deflnition can be contained within one or mor^rnacro flies loaded via the standard -H option. 
Otherwise, the deflnition will be read from a flle calleObmm. The list of train flies can be stored in 



a script flle if required. ^ 

The detailed operation of HRest is controlled by the flowing command line options 

-c f Set the threshold for tied-mixture observation pruning^to f . When all mixtures of all models 
are tied to create a full tied-mixture system, the calculatierfrof output probabilities is treated as 
a special case. Only those mixture component probabilities wiMch fall within f of the maximum 
mixture component probability are used in calculating the^tate output probabilities (default 
10.0). O 

-e f This sets the convergence factor to the real value f . The coTfvea»gence factor is the relative 
change between successive values of P{0\X) (default value O.OOOY^ 

-i N This sets the maximum number of re-estimation cycles to N (defaul^ljalue 20). 

-1 s The string s must be the name of a segment label. When this option iCused, HRest searches 
through all of the training flies and cuts out all segments with the gi^isn label. When this 
option is not used, HRest assumes that each training flle is a single tokq^?^ 

-m N Sets the minimum number of training examples to be N. If fewer than N exai^pies are supplied 
then an error is reported (default value 3). \-0 

-t Normally, training sequences are rejected if they have fewer frames than the number of emit- 
ting states in the HMM. Setting this switch disables this reject mechanism^"'. 

-u flags By default, HRest updates all of the HMM parameters, that is, means, variances, 
mixture weights and transition probabilities. This option causes just the parameters indicated 
by the flags argument to be updated, this argument is a string containing one or more of 
the letters m (mean) , v (variance) , t (transition) and w (mixture weight) . The presence of a 
letter enables the updating of the corresponding parameter set. 



Using this option only makes sense if the HMM has skip transitions 



17.18 HRest 



288 



-V f This sets the minimum variance (i.e. diagonal element of the covariance matrix) to the real 
value f . This is ignored if an explicit variance floor macro is defined. The default value is 0.0. 

-w f Any mixture weight or discrete observation probability which falls below the global constant 
MINMIX is treated as being zero. When this parameter is set, all mixture weights are floored 
to f * MINMIX. 

-B Output HMM definition files in binary format. 

-F fmt Set the source data format to fmt. 

-G fmt Set the label nlgN format to fmt. 

-H mmf Load HMM maci^jpiodel file mmf . This option may be repeated to load multiple MMFs. 

-I mlf This loads the maAer label file mlf . This option may be repeated to load several MLFs. 

-L dir Search directory dir^^^abel files (default is to search current directory). 

-M dir Store output HMM mad^^model files in the directory dir. If this option is not given, the 
new HMM definition will ov^wite the existing one. 



-X ext Set label file extension to ej^^default is lab). 

HRest also supports the standard op^^i^-A, -C, -D, -S, -T, and -V as described in section 4.4. 
17.18.3 Tracing 

HRest supports the following trace options Mihwre each trace fiag is given using an octal base 

000001 basic progress reporting. ^ 

000002 output information on the training data l^^j^. 
000004 the observation probabilities. 



000010 the alpha matrices. 
000020 the beta matrices. 
000040 the occupation counters. 
000100 the transition counters. 
000200 the mean counters. 



CO 



•1^ 

O 



000400 the variance counters. \^ 



001000 the mixture weight counters. 
002000 the re-estimated transition matrix. 
004000 the re-estimated mixture weights. 
010000 the re-estimated means. 
020000 the re-estimated variances. 

Trace flags are set using the -T option or the TRACE configuration variable. 



o 

% 
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17.19 HResults 
17.19.1 Function 

HResults is the HTK performance analysis tool. It reads in a set of label files (typically output 
from a recognition tool such as HVite) and compares them with the corresponding reference 
transcription files. For the analysis of speech recognition output, the comparison is based on a 
Dynamic Programming-based string alignment procedure. For the analysis of word- spotting output, 
the comparison uses the standard US NIST FOM metric. 

When used to calc^ate the sentence accuracy using DP the basic output is recognition statistics 
for the whole file set i^Hjie format 

^ Overall Results 

SENT: 7.Correct=13fS0 [H=13, S=87, N=100] 

WORD : y.Corr=53 . 36 , •Acc=44 . 90 [H=460 , D=49 , 3=353 , 1=73 , N=862] 

The first line gives the sentencSUevel accuracy based on the total number of label files which are 
identical to the transcription filesrTiie second line is the word accuracy based on the DP matches 
between the label files and the traj^criptions In this second line, H is the number of correct 
labels, D is the number of deletions, (^s the number of substitutions, / is the number of insertions 
and N is the total number of labels iil(^e defining transcription files. The percentage number of 
labels correctly recognised is given by 

and the accuracy is computed by 

Accuracy = \P x 100% (17.6) 

In addition to the standard HTK output format, HtlESULTS provides an alternative similar to 
that used in the US NIST scoring package, i.e. 

I I 

I # Snt I Corr Sub Del Ins Err S. Err I 
I > I 

I Sum/Avg I 87 I 53.36 40.95 5.68 8.47'^5vl0 87.00 I 

o 

When HResults is used to generate a confusion matrix, the vla^^^are as follows: 

%c The percentage correct in the row; that is, how many times a i^l^^e instance was correctly 
labelled. q 

%e The percentage of incorrectly labeled phones in the row as a percentag the total number of 
labels in the set. 

An example from the HTKDemo routines: 

====================== HTK Results Analysis =======================0*. 

Date: Thu Jan 10 19:00:03 2002 \> 
Ref : labels/bcplabs/mon 
Rec : test/tel.rec 
test/te2 . rec 
test/te3 . rec 

Overall Results 

SENT: 7.Correct=0.00 [H=0, S=3, N=3] 



The choice of "Sentence" and "Word" here is the usual case but is otherwise arbitrary. HResults just compares 
label sequences. The sequences could be paragraphs, sentences, phrases or words, and the labels could be phrases, 
words, syllables or phones, etc. Options exist to change the output designations 'SENT' and 'WORD' to whatever 
is appropriate. 
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WORD: y.Corr=63.91, Acc=59.40 [H=85, D=35, S=13, 1=6, N=133] 

Confusion Matrix 

S C V N L Del [ 7.c / 7,e] 

S 6 1 0 1 0 0 [75.0/1.5] 

C 2 35 3 1 0 18 [85.4/4.5] 

V 0 1 28 0 1 12 [93.3/1.5] 

N 0 1 0 7 0 1 [87.5/0.8] 

L 0 1 1 0 9 4 [81.8/1.5] 
Ins 2 2 0 2 0 

Reading across the row^^c indicates the number of correct instances divided by the total number 
of instances in the row. is the number of incorrect instances in the row divided by the total 
number of instances (N) . , 

Optional extra outputs a^^able from HResults are 

• recognition statistics on aj^ei file basis 



• recognition statistics on a jre^^eaker basis 

• recognition statistics from best^^ N alternatives 

• time-aligned transcriptions » 

V 

• confusion matrices 

For comparison purposes, it is also possible ^€o)assign two labels to the same equivalence class (see 
-e option). Also, the null label ??? is defiaCRi so that making any label equivalent to the null 
label means that it will be ignored in the maxcja^i^g process. Note that the order of equivalence 
labels is important, to ensure that label X is ignored, ihe command line option -e ??? X would be 
used. Label files containing triphone labels of thevorni A-B+C can be optionally stripped down to 
just the class name B via the -s switch. \ 

The word spotting mode of scoring can be used to ((^ulate hits, false alarms and the associated 
figure of merit for each of a set of keywords. Optionally i^an also calculate ROC information over 
a range of false alarm rates. A typical output is as foUow^^ 



Keyword 
A 
B 

Overall 



Figures of Merit 

#Hits #FAs #Actual POM 

8 1 14 30.54 

4 2 14 15.27 - 

12 3 28 22.91 O 



-v9^ 

which shows the number of hits and false alarms (FA) for two keyword* A and B. A label in the 
test file with start time tg and end time constitutes a hit if there is'wcorresponding label in the 
reference file such that tg < tm < te where tm is the mid-point of the refe@ice label. 

Note that for keyword scoring, the test transcriptions must include a ^^e with each labelled 
word spot and all transcriptions must include boundary time information. 

The FOM gives the % of hits averaged over the range 1 to 10 FA's per hour-~This is calculated 
by first ordering all spots for a particular keyword according to the match scWe. Then for each 
FA rate /, the number of hits are counted starting from the top of the ordered Uyt and stopping 
when / have been encountered. This corresponds to a posteriori setting of the ke^^iwjird detection 
threshold and effectively gives an upper bound on keyword spotting performance. 



17.19.2 Use 

HResults is invoked by typing the command line 

HResults [options] himnList recFiles . . . 

This causes HResults to be applied to each recFile in turn. The hmmList should contain a list 
of all model names for which result information is required. Note, however, that since the context 
dependent parts of a label can be stripped, this list is not necessarily the same as the one used to 
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perform the actual recognition. For each recFile, a transcription file with the same name but the 
extension . lab (or some user specified extension - see the -X option) is read in and matched with it. 
The reef iles may be master label files (MLFs), but note that even if such an MLF is loaded using 
the -I option, the list of files to be checked still needs to be passed, either as individual command 
line arguments or via a script with the -S option. For this reason, it is simpler to pass the recFile 
MLF as one of the command line filename arguments. For loading reference label file MLF's, the 
-I option must be used. The reference labels and the recognition labels must have different file 
extensions. The available options are 

-a s change the lab^^ENT in the output to s. 



-b s change the label ^|£D in the output to s 



-c when comparing labets convert to upper case. Note that case is still significant for equivalences 
(see -e below). • 

-d N search the first N alternatives for each test label file to find the most accurate match with 
the reference labels. Outptdx results will be based on the most accurate match to allow NBest 
error rates to be found. V'^ > 

-est the label t is made equival^i^to the label s. More precisely, t is assigned to an equivalence 
class of which s is the identifyingLmember. 

-f Normally, HResults accumulates\g1^tistics for all input files and just outputs a summary on 
completion. This option forces matl^h^tetistics to be output for each input test file. 

-g fmt This sets the test label format to M^. If this is not set, the recFiles should be in the 
same format as the reference files. 

-h Output the results in the same format as <?NIST scoring software. 

-k s Collect and output results on a speaker by sp^ker basis (as well as globally), s defines a 
pattern which is used to extract the speaker ideWti^er from the test label file name. In addition 
to the pattern matching metacharacters * and f,,.fwhich match zero or more characters and 
a single character respectively), the character 7„ nmtbhes any character whilst including it as 
part of the speaker identifier. 

-m N Terminate after collecting statistics from the first N fifes. > 

-n Set US NIST scoring software compatibility. 

-p This option causes a phoneme confusion matrix to be outpi^^ 

-s This option causes all phoneme labels with the form A-B+C converted to B. It is useful 

for analysing the results of phone recognisers using context dep'^^ent models. 

-t This option causes a time-aligned transcription of each test file to^^-output provided that it 
differs from the reference transcription file. ^-^^ 

-u f Changes the time unit for calculating false alarm rates (for word spotting scoring) to f hours 
(default is 1.0). 

-w Perform word spotting analysis rather than string accuracy calculation. 

-z s This redefines the null class name to s. The default null class name is ?? hich may be 
difficult to manage in shell script programming. 

-G fmt Set the label file format to fmt. 

-I mlf This loads the master label file mlf . This option may be repeated to load several MLFs. 

-L dir Search directory dir for label files (default is to search current directory). 

-X ext Set label file extension to ext (default is lab). 



HResults also supports the standard options -A, -C, -D, -S, -T, and -V as described in section 4.4. 
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17.19.3 Tracing 



HResults supports the following trace options where each trace flag is given using an octal base 

00001 basic progress reporting. 

00002 show error rate for each test alternative. 
00004 show speaker identifier matches. 

00010 warn about non-keywords found during word spotting. 



00040 show memory usaj 



00020 show detailed wj 




spotting scores. 



Trace flags are set using th? -T option or the TRACE conflguration variable. 




CO 




o 
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17.20 HSGen 

17.20.1 Function 

This program will read in a word network definition in standard HTK lattice format representing a 
Regular Grammar G and randomly generate sentences from the language L{G) of G. The sentences 
are written to standard output, one per line and an option is provided to number them if required. 
The empirical entropy H^. can also be calculated using the formula 

where Sk is the k'th sentgifice generated and \Sk \ is its length. The perplexity is computed from 
Hehy ^ 

Pe = 2^<= (17.8) 

The probability of each sentened P{Sk) is computed from the product of the individual branch 
probabilities. v ^ 

17.20.2 Use VL 

HSGen is invoked by the command 

HSGen [options] wdnet dictfilev^ 

where dictf ile is a dictionary containing all of the words used in the word network stored in 
wdnet. This dictionary is only used as a wo?^ist, the pronunciations are ignored. 
The available options are \\\ 

<^ . 

-1 When this option is set, each generated sent^ra^ce is preceded by a line number. 

-n N This sets the total number of sentences gener^t^ to be N (default value 100). 

-q Set quiet mode. This suppresses the printing o fences. It is useful when estimating the 
entropy of L{G) since the accuracy of the latter dep(|nds on the number of sentences generated. 

-s Compute word network statistics. When set, the n^^ber of network nodes, the vocabulary 
size, the empirical entropy, the perplexity, the average Sentence length, the minimum sentence 
length and the maximum sentence length are computect^^d printed on the standard output. 

HSLab also supports the standard options -A, -C, -D, -S, -T, an^-^V as described in section 4.4. 

17.20.3 Tracing 

HSLab supports the following trace options where each trace flag is using an octal base 

00001 basic progress reporting (3 

00002 detailed trace of lattice traversal ^ 
Trace flags are set using the -T option or the TRACE configuration variable. 

% 
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17.21 HSLab 
17.21.1 Function 

HSLab is an interactive label editor for manipulating speech label files. An example of using 
HSLab would be to load a sampled waveform file, determine the boundaries of the speech units of 
interest and assign labels to them. Alternatively, an existing label file can be loaded and edited by 
changing current label boundaries, deleting and creating new labels. HSLab is the only tool in the 
HTK package which makes use of the graphics library HGraf. 

When started HS^AB displays a window which is split into two parts: a display section and 
a control section (see'^^g 17.1). The display section contains the plotted speech waveform with 
the associated labels. 'I^^control section consists of a palette of buttons which are used to invoke 
the various facilities ava^ifcle in the tool. The buttons are laid out into three different groups 
depending on the functionJ;hey perform. Group one (top row) contains buttons related to basic 
input/output commands. Cxwap two (middle row) implements the viewing and record/playback 
functions. The buttons in grcmji^hree (bottom row) are used for labelling. To invoke a particular 
function, place the mouse pointer onto the corresponding button and click once. All commands 
which require further interactiofi^s^hh the user after invocation will display guiding text in the 
message area telling the user whatv^or she is expected to do next. For example, to delete a label, 
the user will click on Delete, the me^^ge "Please select label to delete" will appear in the message 
area and the user will be expected to <^?<vk in that part of the display section corresponding to the 
label to be deleted (not on the label itseiQ. 

A marked region is a slice of the wave^m currently visible in the window. A region is marked 
by clicking on Mark and specifying two boit^aries by clicking in the display section. When marked, 
a region will be displayed in inverse colourg^Sln the presence of a marked region the commands 
Play, Label and Label as will be applied ta,t4^ specified region rather than to the whole of the 
waveform visible on the screen. Part of the wavefe'm can also be made into a marked region with 
the commands Zoom Out and Select. Zoom Out wilj take the user back to the previous level of 
magnification and the waveform being displayed ^^mie the execution of the command will become 
a marked region. Select will make the part of the^^^veform corresponding to a particular label 
into a marked region. This can be useful for playing b^CcJc existing labels. 

Labelling is carried out with Label and Label Era,^^abel will assign The Current Label to 
a specified slice of the waveform, whilst Label as wilNnmmpt the user to type-in the labelling 
string. The Current Label is shown in the button in the bspeom right corner of the control section. 
It defaults to "Speech" and it can be changed by clicking ton the button it resides in. Multiple 
alternative transcriptions are manipulated using the Set [?] *M)d New buttons. The former is used 
to select the desired transcription, the latter is used to create a^^^ alternative transcription. 

O 



o 

% 
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form: nonjmel wjv. Lib«l: nonamsl Ub, Hum sampiss -lOOOO, HTK samplirvj rate: I6.OOOKH2 




X 



Save 



About 



ri<~| [unmark] | <— | ^y^-> ] | Z.ln ] | Z.Out | [Restore] | play | | rec | [ | 
[ Labela; ] [ Delete | \ E^^^ [ Sele<t | [ Adjust | [ Set [01 | | New | [ Undo | [ Speech ] 

v 

Fig. 17.1 H^iiab display window 

\ 

HSLab is invoked by typing the command line 
HSLab [options] dataFile 



17.21.2 Use 



l^s with a WAVEFORM sample kind. If the 
a new file is to be recorded with this 



where dataFile is a data file in any of the supported for 
given data file does not exist, then HSLab will assume tJ 
name. 

The available options for HSLab are 

-a With this switch present, the numeric part of the globaS''M)elling string is automatically 
incremented after every Label operation. \y 



'•6 



e 



-i file This option allows transcription files to be output to the ir^^d master label file (MLF). 

-n Normally HSLab expects to load an existing label file whose nais^'^^ derived from the speech 
data file. This option tells HSLab that a new empty transcriptioj^^jf to be created for the 
loaded data- file. ^ 

-s string This option allows the user to set the string displayed in the "cojjimand" button used 
to trigger external commands. 



o 

% 



-F fmt Set the source data format to fmt. 

-G fmt Set the label file format to fmt. 

-I mlf This loads the master label file mlf . This option may be repeated to load several MLFs. 

-L dir Search directory dir for label files (default is to search current directory). 

-X ext Set label file extension to ext (default is lab). 



HSLab also supports the standard options -A, -C, -D, -S, -T, and -V as described in section 4.4. 
The following is a summary of the function of each HSLab button. 
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Load Load a speech data file and associated transcription. If changes have been made to the 
currently loaded transcription since it was last saved the user will be prompted to save these 
changes before loading the new file. 



Save 



About 



Save changes made to the transcription into a specified label file. 
Print information about HSLab. 



Quit Exit from HSLab. If alterations have been made to the currently loaded transcription since 
it was last saved^the user will be prompted to save these changes before exiting. 



CommcLnd 



This buttor^^ used to trigger an external command which can process the waveform 
file currently loade^in HSLab. This is accomplished by setting the environment variable 
HSLABCMD to the sh^T*ommand required to perform the processing. When the CommcLnd button 
is pressed, any occurrence of $ in the shell command is replaced by the name of the currently 
loaded waveform file. H^^e that only the filename without its extension is substituted. The 
string displayed in the "^«^mand" button can be changed using the -s option. 

Mark Mark a region of the dispC^ed waveform. The user is prompted to specify the start and the 
end point of a region with th^^ouse pointer. The marked region will be displayed in inverse 
colours. Only one region can b^Tisiarked at a time. 

Unmark Unmark a previously marked^^^gion. 
Scroll the display to the left. 



— > 



Scroll the display to the right. 



Z.In 



Zoom into a part of the displayed waveform. If there is a currently marked region then 
that region will be zoomed into, other wiseV the ^user will be prompted to select a slice of the 
waveform by specifying two points using the^^ouse pointer. 

1 ... 

Z . Out Restore the previous viewing level. ^ 

Restore Display the complete waveform into the wind^^. Any marked regions will be unmarked. 

Play If there is a marked region of the waveform then Qit portion of the signal will be played 
through the internal speaker. Otherwise, the commanfl wifl apply to the waveform visible on 
the screen. \J 

This initiates recording from the audio input device. The maximum duration of a recording 
is limited to 2 mins at 16KHz sampling rate. Two bar-graM^are displayed: the first (red) 
shows the number of samples recorded, the second bar (gr^§ii7 displays the energy of the 
incoming signal. Once pressed, the Rec button changes into S15[^ which, in turn, is used to 
terminate the operation. When finished, the audio data stored i^^-^^p buffer is written out to 
disk. Each recording is stored in alternating files dataFile_0 and (^ert|aFile_l. 

Clicking on this button pauses/un-pauses the recording operation. 



Rec 



Pause 



Volume This button is used to select the playback volume of the audio device, 



xl This button selects the current level of waveform magnification. The availal^le)factors are xl, 
x2, x4, x8, xl6, and x32. 

Label If a marked region exists, then the waveform contained in the region will be labelled with 
The Current Label. Otherwise, the command will be applied to the waveform visible on the 



Labelas 



Same as above, however, the user is prompted to type in the labelling string. 



Delete Delete a label. 



Edit Edit the string of a label. 



Select 



Select a label as a marked region. 
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Adjust Adjust the boundaries of a label. To select the label boundary to adjust, click in the 



display near to the label boundary. 



Set [?] This button is used to select the current alternative transcription displayed and used in 



New 



HSLab. 

Creates a new alternative transcription. If an empty alternative transcription already exists, 
then a new transcription is not created. 



Undo 



Single level operation for labelling commands. 



Speech Change the Ci^ent labelling string (the button in the bottom right of the control area). 



The following "mouse'^^hortcuts are provided. To mark a region position the pointer at one of 
the desired boundaries, thSn press the left mouse button and while holding it down position the 
pointer at the other region tj^ndary. Upon releasing the mouse button the marked region will 
be hilighted. To play a label p^ition the mouse cursor anywhere within the corresponding label 
"slice" in the label area of the ^veplay and click the left mouse button. 



17.21.3 Tracing ^ 

HSLab does not provide any trace op1^3)is. 

(J) 



O 



o 
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17.22 HSmooth 

17.22.1 Function 

This program is used to smooth a set of context-dependent tied mixture or discrete HMM's using 
deleted interpolation. The program operates as a replacement for the second pass of HERest when 
working in parallel mode^^' . It reads in the N sets of accumulator files containing the statistics 
accumulated during the first pass and then interpolates the mixture weights between each context 
dependent model and its corresponding context independent version. The interpolation weights are 
chosen to maximise th^ likelihood of each deleted block with respect to the probability distributions 
estimated from all otl^^blocks. 

17.22.2 Use 

HSmooth is invoked via tSe command line 
HSmooth [options] hmmLsr^ accFile . . . 



where hnmiList contains the list^^ cpntext dependent models to be smoothed and each accFile 
is a file of the form HERN . acc duYnped by a previous execution of HERest with the -p option 
set to N. The HMM definitions are ^^ded and then for every state and stream of every context 
dependent model X, the optimal interrac^ation weight is found for smoothing between the mixture 
weights determined from the X accumui^ars alone and those determined from the context indepen- 
dent version of X. The latter is compute^^imply by summing the accumulators across all context 
dependent allophones of X. \^ 

The detailed operation of HSmooth is «6^rolled by the following command line options 

-b f Set the value of epsilon for convergenc^Sln, the binary chop optimisation procedure to f. 
The binary chop optimisation procedure SQr each interpolation weight terminates when the 
gradient is within epsilon of zero (default O.^Oi). 

-c N Set maximum number of interpolation iteraticlo^for the binary chop optimisation procedure 
to be N (default 16). O 

-d dir Normally HSmooth expects to find the HMMMefinitions in the current directory. This 
option tells HSmooth to look in the directory dir wmnd them. 

-m N Set the minimum number of training examples requirfid^or any model to N. If the actual 
number falls below this value, the HMM is not updated Wd the original parameters are used 
for the new version (default value 1). \^ 

-o ext This causes the file name extensions of the original mode^^i^any) to be replaced by ext. 

-s file This causes statistics on occupation of each state to be outj^^to the named file. 

-u flags By default, HSmooth updates all of the HMM paramete^P^^^hat is, means, variances 
and transition probabilies. This option causes just the parameteradicated by the flags 
argument to be updated, this argument is a string containing one q^'^^ore of the letters m 
(mean), v (variance) , t (transition) and w (mixture weight). The presence of a letter enables 
the updating of the corresponding parameter set. 

-V f This sets the minimum variance (i.e. diagonal element of the covariance n^^rix) to the real 
value f (default value 0.0). 

-w f Any mixture weight which falls below the global constant MINMIX is treated as being zero. 
When this parameter is set, all mixture weights are floored to f * MINMIX. 

-X ext By default, HSmooth expects a HMM definition for the model X to be stored in a file 
called X. This option causes HSmooth to look for the HMM definition in the file X.ext. 

-B Output HMM definition files in binary format. 

-H mmf Load HMM macro model file mmf . This option may be repeated to load multiple MMFs. 



^®It is not, of course, necessary to have multiple processors to use this program since each 'parallel' activation can 
be executed sequentially on a single processor 



17.22 HSmooth 299 

-M dir Store output HMM macro model files in the directory dir. If this option is not given, the 
new HMM definition will overwrite the existing one. 

HSmooth also supports the standard options -A, -C, -D, -S, -T, and -V as described in section 4.4. 
17.22.3 Tracing 

HSmooth supports the following trace options where each trace flag is given using an octal base 

00001 basic progress^^porting. 

00002 show interpolati^^weights. 

00004 give details of opt^iisation algorithm. 

Trace flags are set using th? -T option or the TRACE configuration variable. 

\ 

\ 

(J) 

o 
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17.23 HVite 

17.23.1 Function 

HVite is a general-purpose Viterbi word recogniser. It will match a speech file against a network 
of HMMs and output a transcription for each. When performing N-best recognition a word level 
lattice containing multiple hypotheses can also be produced. 

Either a word level lattice or a label file is read in and then expanded using the supplied 
dictionary to create a model based network. This allows arbitrary finite state word networks and 
simple forced alignme^ to be specified. 

This expansion cair™? used to create context independent, word internal context dependent and 
cross word context dep^2^nt networks. The way in which the expansion is performed is determined 
automatically from the d^ionary and HMMList. When all labels appearing in the dictionary are 
defined in the HMMList n^ expansion of model names is performed. Otherwise if all the labels 
in the dictionary can be satisfisd by models dependent only upon word internal context these will 
be used else cross word context^.eicpansion will be performed. These defaults can be overridden by 
HNet configuration parameteri^ 

HVite supports shared para^i^ters and appropriately pre-computes output probabilities. For 
increased processing speed, HVn«^^an optionally perform a beam search controlled by a user 
specified threshold (see -t option). Wken fully tied mixture models are used, observation pruning 
is also provided (see the -c option). Speaker adaptation is also supported by HVite both in terms 
of recognition using an adapted model sel^r a TMF (see the -k option), and in the estimation of a 
transform by unsupervised adaptation us^ng linear transformation in an incremental mode (see the 
-j option) or in a batch mode (-K option)y^ 

17.23.2 Use \^ 

HVite is invoked via the command line 

HVite [options] dictFile hmmList testFil^^ . . . 

HVite will then either load a single network file and(^atch this against each of the test files -w 
netPile, or create a new network for each test file eith«Kfrom the corresponding label file -a or 
from a word lattice -w. When a new network is created fbi>*ach test file the path name of the label 
(or lattice) file to load is determined from the test file name and the -L and -X options described 
below. • > 

If no testPiles are specified the -w s option must be spec^e^ and recognition will be performed 
from direct audio. 

The hmmList should contain a list of the models required to^C^nstruct the network from the 
word level representation. ^ 

The recogniser output is written in the form of a label file whoS^^^^ath name is determined from 
the test file name and the -1 and -x options described below. The list^^test files can be stored in 
a script file if required. 

When performing N-best recognition (see -n N option described bel^) the output label file 
can contain multiple alternatives -n N M and a lattice file containing muTn^e hypotheses can be 
produced. ^ 

The detailed operation of HVite is controlled by the following command^ine options 



-a Perform alignment. HVite will load a label file and create an alignment ^^twork for each 



test file. 



-b s Use s as the sentence boundary during alignment. 

-c f Set the tied-mixture observation pruning threshold to f . When all mixtures of all models are 
tied to create a full tied-mixture system, the calculation of output probabilities is treated as a 
special case. Only those mixture component probabilities which fall within f of the maximum 
mixture component probability are used in calculating the state output probabilities (default 
10.0). 



-d dir This specifies the directory to search for the HMM definition files corresponding to the 
labels used in the recognition network. 
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-e When using direct audio input, output transcriptions are not normally saved. When this 
option is set, each output transcription is written to a file called PnS where n is an integer 
which increments with each output file, P and S are strings which are by default empty but 
can be set using the configuration variables RECOUTPREFIX and RECDUTSUFFIX. 

-f During recognition keep track of full state alignment. The output label file will contain 
multiple levels. The first level will be the state number and the second will be the word name 
(not the output symbol). 

-g When using direj^t audio input , this option enables audio replay of each input utterance after 
it has been recc^ased. 



-h mask Set the mask ropdetermining which transform names are to be used for the input trans- 
forms. 

-i s Output transcriptions .^^^LF s. 

-j i Perform incremental MDC^ adaptation every i utterances 
-k Use an input transform (d^^iLfe off). 



-1 dir This specifies the directory tn\store the output label files. If this option is not used then 
HViTE will store the label files^iBvthe same directory as the data. When output is directed 
to an MLF, this option can be ueVd to add a path to each output file name. In particular, 
setting the option -1 ' * ' will cai5^ a label file named xxx to be prefixed by the pattern 
"*/xxx" in the output MLF file. Thi^fs useful for generating MLFs which are independent 
of the location of the corresponding d%fc^ files. 

-m During recognition keep track of model b^jjindaries. The output label file will contain multiple 
levels. The first level will be the model n er and the second will be the word name (not 
the output symbol). 

-n i m use i tokens in each state to perform N^^st recognition. The number of alternative 
output hypotheses N defaults to 1. 

-o s Choose how the output labels should be formatt^d^ s is a string with certain letters (from 
NSCTWM) indicating binary fiags that control formati^!^ options. N normalise acoustic scores 
by dividing by the duration (in frames) of the segment. S remove scores from output label. 
By default scores will be set to the total likelihood of ■ibifa segment . C Set the transcription 
labels to start and end on frame centres. By default start>times are set to the start time 
of the frame and end times are set to the end time of tftei'ame. T Do not include times 
in output label files. W Do not include words in output lab«l/files when performing state or 
model ahgnment. M Do not include model names in output 1^^ files when performing state 
and model alignment. 

-p f Set the word insertion log probability to f (default 0.0). 

-q s Choose how the output lattice should be formatted, s is a string -^^^^certain letters (from 
ABtvaldmn) indicating binary fiags that control formatting options, a attach word labels to 
arcs rather than nodes. B output lattices in binary for speed, t output fcode times, v output 
pronunciation information, a output acoustic likelihoods. 1 output lan^i^ge model likeli- 
hoods, d output word alignments (if available), m output within word ali^ffljient durations, 
n output within word alignment likelihoods. 

-r f Set the dictionary pronunciation probability scale factor to f . (default value 1.0). 

-s f Set the grammar scale factor to f . This factor post-multiplies the language model likelihoods 
from the word lattices, (default value 1.0). 

-t f [i 1] Enable beam searching such that any model whose maximum log probability token 
falls more than f below the maximum for all models is deactivated. Setting f to 0.0 disables 
the beam search mechanism (default value 0.0). In alignment mode two extra parameters 
i and 1 can be specified. If the alignment fails at the initial pruning threshold f , then the 
threshold will by increased by i and the alignment will be retried. This procedure is repeated 
until the alignment succeeds or the threshold limit 1 is reached. 
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-u i Set the maximum number of active models to i. Setting i to 0 disables this limit (default 0). 

-V f Enable word end pruning. Do not propagate tokens from word end nodes that fall more than 
f below the maximum word end likelihood, (default 0.0). 

-w [s] Perform recognition from word level networks. If s is included then use it to define the 
network used for every file. 

-X ext This sets the extension to use for HMM definition files to ext. 

-y ext This sets the^^^ctension for output label files to ext (default rec). 

-z ext Enable output^^fii lattices (if performing NBest recognition) with extension ext (default 
off). ^ 

-L dir This specifies the directory to find input label (when -a is specified) or network files (when 
-w is specified). \^ 

-X s Set the extension for the label or network files to be s (default value lab) . 

-E dir [ext] Parent transform (^rectory and optional extension for parent transforms. The de- 
fault option is that no parent ''Mwiisform is used. 



-G f mt Set the label file format to f m^^^ 

-H mmf Load HMM macro model file mmiJ^T^is option may be repeated to load multiple MMFs. 

-I mlf This loads the master label file mliy.^his option may be repeated to load several MLFs. 

-J dir [ext] Add directory to the list of pc^^ble input transform directories. Only one of the 
options can specify the extrension to use f^J^he input transforms. 

-K dir [ext] Output transform directory and al extension for output transforms. The 

default option is that there is no output exten^n and the current transform directoryis used. 

-P fmt Set the target label format to fmt. ^^^^^ 

HViTE also supports the standard options -A, -C, -D, -S,^^, and -V as described in section 4.4. 
17.23.3 Tracing 

HViTE supports the following trace options where each trace fll^^s given using an octal base 

0001 enable basic progress reporting. 

0002 list observations. 

0004 frame-by-frame best token. 

0010 show memory usage at start and finish. 

0020 show memory usage after each utterance. , 
Trace fiags are set using the -T option or the TRACE configuration variable, o 

% 
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17.24 LAdapt 

17.24.1 Function 

This program will adapt an existing language model from supplied text data. This is accomplished 
in two stages. First, the text data is scanned and a new language model is generated. In the second 
stage, an existing model is loaded and adapted (merged) with the newly created one according to 
the specified ratio. The target model can be optionally pruned to a specific vocabulary. Note that 
you can only apply this tool to word models or the class n-gram component of a class model - that 
is, you cannot apply full class models. 

17.24.2 Use 

LAdapt is invoked by the command line 

LAdapt [options] -i w^J^ht inLMFile outLMFile [texttfile ...] 



The text data is scanned and afnew LM generated. The input language model is then loaded and 
the two models merged. The efFe^ of the weight (0.0-1.0) is to control the overall contribution of 
each model during the merging prop^ss. The output to outLMFile is an n-gram model stored in 
the user-specified format. 

The allowable options to LAdapt as follows 

-a n Allow upto n new words in input feja (default 100000). 

-b n Set the n-gram buffer size to n. Thisoontrols the size of the buffer used to accumulate n- 
gram statistics whilst scanning the inpi^rtext. Larger buffer sizes will result in more efficient 
operation of the tool with fewer sort op^cS);ions required (default 2000000). 

-c n c Set the pruning threshold for n-grams Pruning can be applied to the bigram (n=2) 
and longer {ni2) components of the newly gOTerated model. The pruning procedure will keep 
only n-grams which have been observed more ^han c times. 

-d s Set the root n-gram data file name to s. By def£^^vn.-gram statistics from the text data will 
be accumulated and stored as gram.O, grsmi. 1, N.^etc. Note that a larger buffer size will 
result in fewer files. \V 

-f s Set the output language model format to s. Possible^-^tions are text for the standard 
ARPA-MIT LM format, bin for Entropic binary format B^^ltra for Entropic ultra format. 

-g Use existing n-gram data files. If this option is specified the tfJDl will use the existing gram files 
rather than scanning the actual texts. This option is useful wnen adapting multiple language 
models from the same text data or when experimenting witlKainerent merging weights. 

-i w f Interpolate with model f using weight w. Note that at least £)H:e\model must be specified 
with this option. 

-j n c Set weighted discounting pruning for n grams to c. This cannot be\applied to unigrams 
(n=l). 

-n n Produce n-gram language model. 
-s s Store s in the header of the gram files. 

-t Force Turing-Good discounting if configured otherwise. 



-w f n Load word list from f n. The word list will be used to define the target model's vocabulary. If 
a word list is not specified, the target model's vocabulary will have all words from the source 
model(s) together with any new words encountered in the text data. 

-X Create a count-based model. 



LAdapt also supports the standard options -A, -C, -D, -S, -T, and -V as described in section 4.4. 
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17.24.3 Tracing 



LAdapt supports the following trace options where each trace flag is given using an octal base 

00001 basic progress reporting 

00002 monitor buffer saving 
00004 trace word input stream 
00010 trace shift register input 



Trace flags are set usin^he -T option or the TRACE configuration variable. 





CO 
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17.25 LBuild 

17.25.1 Function 

This program will read one or more input gram files and generate/update a back-off n-gram language 
model as described in section 14.5. The -n option specifies the order of the final model. Thus, to 
generate a trigram language model, the user may simply invoke the tool with -n 3 which will cause 
it to compute the FoF table and then generate the unigram, bigram and trigram stages of the 
model. Note that intermediate model/FoF files will not be generated. 

As for all tools whijh process gram files, the input gram files must each be sorted but they need 
not be sequenced. TWe\ounts in each input file can be modified by applying a multiplier factor. 
Any n-gram containing^tii id which is not in the word map is ignored, thus, the supplied word 
map will typically conta^fjust those word and class ids required for the language model under 
construction (see LSubset^. 

LBuild supports Turing^^^od and absolute discounting as described in section 14.3.1. 

17.25.2 Use 

LBuild is invoked by typing the ojfl^mand line 

LBuild [options] wordmap ou'^^le [mult] gramf ile . . [mult] gramf ile . . 

The given word map file is loaded^pd then the set of named gram files are merged to form 
a single sorted stream of n-grams. Any nC^grs^ms containing ids not in the word map are ignored. 
The list of input gram files can be inters^^sed with multipliers. These are floating-point format 
numbers which must begin with a plus or rfi^iis character (e.g. +1.0, -0.5, etc.). The effect of a 
multiplier x is to scale the n-gram counts in t-^^foUowing gram files by the factor x. A multiplier 
stays in effect until it is redefined. The output ta<-)^utf ile is a back-off n-gram language model file 
in the specified file format. \. 

See the LPCalc options in section 18.1 for deCaik. on changing the discounting type from the 
default of Turing-Good, as well as other configurati(5p rile options. 

The allowable options to LBuild are as follows 

-c n c Set cutoff for n-gram to c. ^ 

-d n c Set weighted discount pruning for n-gram to c for ^eymore-Rosenfeld pruning, 
-f t Set output model format to t (TEXT, BIN, ULTRA). X5 
-k n Set discounting range for Good- Turing discounting to [l..n]^-^ 
-1 f Build model by updating existing LM in f . 
-n n Set final model order to n. 

-t f f Load the FoF file f . This is only used for Turing-Good discountin^j^nd is not essential, 
-u c Set the minimum occurrence count for unigrams to c. (Default is 1) 

-X Produce a counts model. * 
LBuild also supports the standard options -A, -C, -D. -S, -T, and -V as describ^^n section 4.4. 

17.25.3 Tracing ^ 

LBuild supports the following trace options where each trace flag is given using an octal base 
00001 basic progress reporting. 



Trace flags are set using the -T option or the TRACE configuration variable. 
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17.26 LFoF 

17.26.1 Function 

This program will read one or more input gram files and generate a frequency- of- frequency or FoF 
file. A FoF file is a list giving the number of times that an n-gram occurs just once, the number of 
times that an n-gram occurs just twice, etc. The format of a FoF file is described in section 16.6. 

As for all tools which process gram files, the input gram files must each be sorted but they need 
not be sequenced. The counts in each input file can be modified by applying a multiplier factor. 
Any n-gram containing an id which is not in the word map is ignored, thus, the supplied word 
map will typically com|tjn just those word and class ids required for the language model under 
construction (see LSu^fer). 

LFoF also provides e^jjption to generate an estimate of the number of n-grams which would 
be included in the final language model for each possible cutoff by setting LPCALC: TRACE = 2. 

17.26.2 Use 

LFoF is invoked by typing the cS^^mand line 

LFoF [options] wordmap foff'tLe [mult] gramf ile . . [mult] gramf ile . . 

The given word map file is loaded and ^^n the set of named gram files are merged to form a single 
sorted stream of n-grams. Any n-grarnfejaOntaining ids not in the word map are ignored. The list 
of input gram files can be interspersed wrfn ijiultipliers. These are floating-point format numbers 
which must begin with a plus or minus ch^^cter (e.g. +1.0, -0.5, etc.). The effect of a multiplier 
X is to scale the n-gram counts in the foUo^^^g gram files by the factor x. A multiplier stays in 
effect until it is redefined. The output to f of i^ij^e is a FoF file as described in section 16.6. 
The allowable options to LFoF are as foUow^^ 

-f N set the number of FoF entries to N (default '^(V). 
-n N Set n-gram size to N (defaults to max) . 

LFoF also supports the standard options -A, -C, -D, -3^T, and -V as described in section 4.4. 

CO 

17.26.3 Tracing 



LFoF supports the following trace options where each trace flag is given using an octal base 
00001 basic progress reporting q 
Trace flags are set using the -T option or the TRACE configuration ^^ble. 

o 

% 
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17.27 LGCopy 
17.27.1 Function 

This program will copy one or more input gram files to a set of one or more output gram files. 
The input gram files must each be sorted but they need not be sequenced. Unless word-to-class 
mapping is being performed, the output files will, however, be sequenced. Hence, given a collection 
of unsequenced gram files, LGCopy can be used to generate an equivalent sequenced set. This is 
useful for reducing the number of parallel input streams that tools such as LBuild must maintain, 
thereby improving efMency. 

As for all tools wmpi can input gram files, the counts in each input file can be modified by 
applying a multiplier fa2ter. Note, however, that since the counts within gram files are stored as 
integers, use of non-integ^jmultiplier factors will lead to the counts being rounded in the output 
gram files. ^ 

In addition to manipulatingNthe counts, the -n option also allows the input grams to be truncated 
by summing the counts of alPeniiivalenced grams. For example, if the 3-grams a x y 5 and b x y 
3 were truncated to 2-grams, t\en x y 8 would be output. Truncation is performed before any of 
the mapping operations describe^li^dow. 

LGCopy also provides options^ map gram words to classes using a class map file and filter 
the resulting output. The most conlB\on use of this facility is to map out-of- vocabulary (00 V) 
words into the unknown symbol in pre«^ration for building a conventional word rt-gram language 



model for a specific vocabulary. HoweveYjiit can also be used to prepare for building a class-based 
n-gram language model. 

Word-to-class mapping is enabled by ^^cifying the class map file with the -w option. Each 
n-gram word is then replaced by its class S}flM)ol as defined by the class map. If the -o option is 
also specified, only n-grams containing class SymlDols are stored in the internal buffer. 

17.27.2 Use ^ 

LGCopy is invoked by typing the command line 

LGCopy [options] wordmap [mult] grarnif ile .(^. [mult] gramf ile . . . 

The given word map file is loaded and then the set of^named gram files are input in parallel to 
form a single sorted stream of n-grams. Counts for iden^^kl n-grams in multiple source files are 
summed. The merged stream is written to a sequence of ouljjut eram files named data.O, data. 1, 
etc. The list of input gram files can be interspersed with wlmtipliers. These are floating-point 
format numbers which must begin with a plus or minus character (e.g. +1.0, -0.5, etc.). The 
effect of a multiplier x is to scale the n-gram counts in the followSne gram files by the factor x. The 
resulting scaled counts are rounded to the nearest integer on ou^udt. A multiplier stays in effect 
until it is redefined. The scaled input grams can be truncated, rri,^^d and filtered before being 
output as described above. 

The allowable options to LGCopy are as follows V^^) 

-a n Set the maximum number of new classes that can be added to the@brd map (default 1000, 
only used in conjuction with class maps). ^\ 

-b n Set the internal gram buffer size to n (default 2000000). LGCopy stoKs incoming n-grams 
in this buffer. When the buffer is full, the contents are sorted and writtei^t^ an output gram 
file. Thus, the buffer size determines the amount of process memory that ^^Copy will use 
and the size of the individual output gram files. 

-d Directory in which to store the output gram files (default current directory). 

-i n Set the index of the first gram file output to be n (default 0). 

-m s Save class-resolved word map to f n. 

-n n Normally, n-gram size is preserved from input to output. This option allows the output 
n-gram size to be truncated to n where n must be less than the input n-gram size. 

-o n Output class mappings only. Normally all input n-grams are copied to the output, however, 
if a class map is specified, this options forces the tool to output only n-grams containing at 
least one class symbol. 
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-r s Set the root name of the output gram files to s (default "data" ) . 
-w f n Load class map from fn. 

LGCopy also supports the standard options -A, -C, -D. -S, -T, and -V as described in section 4.4. 
17.27.3 Tracing 

LGCopy supports the following trace options where each trace flag is given using an octal base 

00001 basic progress^porting. 

00002 monitor buffer s'^^operations. 

Trace flags are set using the -T option or the TRACE configuration variable. 

\ 

\ 

(J) 
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17.28 LGList 



17.28.1 Function 



This program will list the contents of one or more HLM gram files. In addition to printing the whole 
file, an option is provided to print just those 7i-grams containing certain specified words and/or ids. 
It is mainly used for debugging. 



17.28.2 Use 



LGList is invoked b^;^ing the command line 



LGList [options] i^^pf ile gramf ile .... 



The specified gram files are»printed to the output. The ri-grams are printed one per line following a 
summary of the header infor^^^tion. Each n-gram is printed in the form of a list of words followed 
by the count. 

pirmt 



Normally all n-grams are pmited. However, if either of the options -i or -f are used to add 
words to a filter list, then only tftpse^-grams which include a word in the filter list are printed. 
The allowable options to LGLi^^are as follows 




-f w Add word w to the filter list. Thi^ption can be repeated, it can also be mixed with uses of 
the -i option. 

-i n Add word with id n to the filter lisii^Ris option can be repeated, it can also be mixed with 
uses of the -f option. 

LGList also supports the standard options -AJ^C, -D, -S, -T, and -V as described in section 4.4. 

o 

17.28.3 Tracing 



LGList supports the following trace options where ^adi trace flag is given using an octal base 

00001 basic progress reporting. \' 

Trace flags are set using the -T option or the TRACE confi^S)ation variable. 
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17.29 LGPrep 
17.29.1 Function 

The function of this tool is to scan a language model training text and generate a set of gram files 
holding the n-grams seen in the text along with their counts. By default, the output gram files are 
named gram.O, grsmi. 1, gram. 2, etc. However, the root name can be changed using the -r option 
and the start index can be set using the -i option. 

Each output gram file is sorted but the files themselves will not be sequenced (see section 16.5). 
Thus, when using LGgREP with substantial training texts, it is good practice to subsequently copy 
the complete set of oiirowt gram files using LGCopy to reorder them into sequence. This process 
will also remove duplicaSE^.occurrences making the resultant files more compact and faster to read 
by the HLM processing t^is. 

Since LGPrep will oft^ encounter new words in its input, it is necessary to update the word 
map. The normal operation tbsrefore is that LGPrep begins by reading in a word map containing 
all the word ids required to Mee^de all previously generated gram files. This word map is then 
updated to include all the newywords seen in the current input text. On completion, the updated 
word map is output to a file of(^ i ^ame name as the input word map in the directory used to 
store the new gram files. Alternati't'^Jy, it can be output to a specified file using the -w option. The 
sequence number in the header of tn^^ewly created word map will be one greater than that of the 
original. ^rs 

LGPrep can also apply a set of "m^gh and replace" edit rules to the input text stream. The 
purpose of this facility is not to replace niput text conditioning filters but to make simple changes 
to the text after the main gram files haveN^en generated. The editing works by passing the text 
through a window one word at a time. The\fi^t rules consist of a pattern and a replacement text. 
At each step, the pattern of each rule is matchedragainst the window and if a match occurs, then the 
matched word sequence is replaced by the string the replaced part of the rule. Two sets of gram 
files are generated by this process. A "negativcN- set^of gram files contain n-grams corresponding 
to just the text strings which were modified and(a "positive" set of gram files contain n-grams 
corresponding to the modified text. All text for wh^h no rules matched is ignored and generates 
no gram file output. Once the positive and negative gram files have been generated, the positive 
grams are added (i.e. input with a weight of +1) to the^iginal set of gram files and the negative 
grams are subtracted (i.e. input with a weight of -1). The.net result is that the tool reading the 
full set of gram files receives a stream of n-grams which wiKwe identical to the stream that it would 
have received if the editing commands had been applied to khe text source when the original main 
gram file set had been generated. 

The edit rules are stored in a file and read in using the -f optifA. They consist of set definitions 
and rule definitions, each written on a separate line. Each set defines, a set of words and is identified 
by an integer in the range 0 to 255 

<set-def> = ' # ' <number> <wordl> <word2> ... <wordM>->. 

For example, 

#4 red green blue 

defines set number 4 as being the 3 words "red" , "green" and "blue" . Rules ^^sist of an application 
factor, a pattern and and a replacement • 

<rule-def> = <app-factor> <pattern> : <replacemeiit> o 
<pattern> = { <word> I I !<set> I 7,<set> } Q). 

<replacement> = { '$'<field> I string } $' - work around emacs 

°/o colouring bug 

The application factor should be a real number in the range 0 to 1 and it specifies the proportion of 
occurrences of the pattern which should be replaced. The pattern consists of a sequence of words, 
wildcard symbols ("*") which match anyword, and set references of the form 7„n denoting any word 
which is in set number n and ! n denoting any word which is not in set number n. The replacement 
consists of a sequence of words and field references of the form $i which denotes the i'th matching 
word in the input. 

As an example, the following rules would translate 50% of the occurrences of numbers in the 
form "one hundred fifty" to "one hundred and fifty" and 30% of the occurrences of "one hundred" 
to "a hundred" . 
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#0 one two three four five six seven eight nine fifty sixty seventy 
#1 hundred 

0.5 * * hundred 7.0 * * : $0 $1 $2 and $3 $4 $5 
0.3 * * !0 one y.l * * : $0 $1 $2 a $4 $5 $6 

Note finally, that LGPrep processes edited text in a parallel stream to normal text, so it is possible 
to generate edited gram files whilst generating the main gram file set. However, normally the main 
gram files already exist and so it is normal to suppress gram file generation using the -z option 
when using edit rules. 

17.29.2 Use ^ 

LGPrep is invoked by t;^iiig the command line 

LGPrep [options] woAtai^ [textfile ...] 

Each text file is processed in t^ra and treated as a continuous stream of words. If no text files are 
specified standard input is used^'and this is the more usual case since it allows the input text source 
to be filtered before input to LGtrkP, for example, using LCond.pl (in LMTutorial/extras/). 

Each n-gram in the input streaj^is stored in a buffer. When the buffer is full it is sorted and 
multiple occurrences of the same n-(^Lm are merged and the count set accordingly. When this 
process ceases to yield sufficient buffer^^ace, the contents are written to an output gram file. 

The word map file defines the mapi^^te of source words to the numeric ids used within HLM 
tools. Any words not in the map are all(5cat©d new ids and added to the map. On completion, a 
new map with the same name (unless specmed otherwise with the -w option) is output to the same 
directory as the output gram files. To initial^^the first invocation of this updating process, a word 
map file should be created with a text editor (toitaining the following: 

Name=xxxx v 

SeqNo=0 V 

Language=yyyy 

Entries=0 

Fields=ID 

\Words\ ^ ^ 

where xxxx is an arbitrarily chosen name for the word n^ap and yyyy is the language. Fields 
specifying the escaping mode to use (HTK or RAW) and changiAg^ields to include frequency counts 
in the output (i.e. FIELDS = ID.WFC) can also be given. Altema^vely, they can be added to the 
output using command line options. \ 
The allowable options to LGPrep are as follows C3 

-a n Allow upto n new words in input texts (default 100000). 

-b n Set the internal gram buffer size to n (default 2000000). LGPH;f^^tores incoming n-grams 
in this buffer. When the buffer is full, the contents are sorted and v^ten to an output gram 
file. Thus, the buffer size determines the amount of process memory Jt^at LGPrep will use 
and the size of the individual output gram files. ^ 

• 

-c Add word counts to the output word map. This overrides the setting in ijfi^ input word map 
(default off). ^-^ 

-d Directory in which to store the output gram files (default current directory) 
-e n Set the internal edited gram buffer size to n (default 100000). 
-f s Fix (i.e. edit) the text source using the rules in s. 

-h Do not use HTK escaping in the output word map (default on), 
-i n Set the index of the first gram file output to be n (default 0). 
-n n Set the output n-gram size to n (default 3). 

-q Tag words at sentence start with underscore (_). 



17.29 LGPrep 312 

-r s Set the root name of the output gram files to s (default "gram" ) . 

-s s Write the string s into the source field of the output gram files. This string should be a 
comment describing the text source. 

-w s Write the output map file to s (default same as input map name stored in the output gram 
directory). 

-z Suppress gram file output. This option allows LGPrep to be used just to compute a word 
frequency map. It is also normally applied when applying edit rules to the input. 

-Q Print a summa^^^all commands supported by this tool. 
LGPrep also supports tl^jitandard options -A, -C, -D, -S, -T, and -V as described in section 4.4. 

17.29.3 Tracing ^ 

LGPrep supports the foUowine^race options where each trace flag is given using an octal base 

00001 basic progress reportmg. 

00002 monitor buffer save operation^ 
00004 Trace word input stream. ^ , 

V 

00010 Trace shift register input. 
00020 Rule input monitoring. 

00040 Print rule set. '^v' 

Trace flags are set using the -T option or the TRAM>cronfiguration variable. 

(J) 
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17.30 LLink 

17.30.1 Function 

This tool will create the link file necessary to use the word-given-class and class-given-class compo- 
nents of a class n-gram language model 

Having created the class n-gram component with LBuild and the word-given-class component 
with Cluster, you can then create a third file which points to these two other files by using the 
LLink tool. This file is the language model you pass to utilities such as LPlex. Alternatively 
if run with its -s op^n then LLink will link the two components together and create a single 
resulting file. 

17.30.2 Use 

LLink is invoked by the co"mi^nd line 

LLink [options] word-c i^sLMf il e class-classLMf ile outLMfile 

The tool checks for the existenc^^<^^he two existing component language model files, with word- 
classLMf ile being the word-givm^elass file from Cluster and class-classLMf ile being the 
class n-gram model generated by LI^^LD. The word-given-class file is read to discover whether it 
is a count or probability-based file, and^:lien an appropriate link file is written to outLMfile. This 
link file is then suitable for passing to E^BtEX. Optionally you may overrule the count/probability 
distinction by using the -c and -p parawretfifs. Passing the -s parameter joins the two files into 
one single resulting language model rather'^uian creating a third link file which points to the other 
two. 

The allowable options to LLink are as foll^^ 

-c Force the link file to describe the word-giv^n^class component as a 'counts' file. 

-p Force the link file to describe the word-given-c^Jil^s component as a 'probabilities' file. 

-s Write a single file containing both the word-cla^^^mponent and the class-class component. 
This single resulting file is then a self-contained la^^age model requiring no other files. 

LLink also supports the standard options -A, -C, -D, -S,^, and -V as described in section 4.4. 
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17.30.3 Tracing 

LLink supports the following trace options where each trace flgSg^^given using an octal base 
00001 basic progress reporting 

Trace fiags are set using the -T option or the TRACE configuration vai^i^le. 
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17.31 LMerge 

17.31.1 Function 

This program combines one or more language models to produce an output model for a specified 
vocabulary. You can only apply it to word models or the class rt-gram component of a class model 
- that is, you cannot apply it to full class models. 

17.31.2 Use 

LMerge is invoked ^^^^ping the command line 

LMerge [options] w^dList inModel outModel 



The word map and class m^ip are loaded, word-class mappings performed and a new map is saved 
to outMapFile. The output -^i^p's name will be set to 

Name = inMapNaine7o7oClassMap^ine 

The allowable options to LM^^^ are as follows 
-f s Set the output LM file formates. Available options are text, bin or ultra (default bin), 
-i f fn Interpolate with model fn us^^^weight f . 
-n n Produce an n-gram model. 

LMerge also supports the standard options^, -C, -D, -S, -T, and -V as described in section 4.4. 
17.31.3 Tracing O 



LMerge Does not provide any trace options. HowsK^er, trace information is available from the 
underlying library modules LWMap and LCMap sy setting the appropriate trace configuration 
parameters. 
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17.32 LNewMap 

17.32.1 Function 

This tool will create an empty word map suitable for use with LGPrep. 

17.32.2 Use 

LNewMap is invoked by the command line 
LNewMap [optioi^U name mapfn 



A new word map is crewrpsd with the file name 'mapfn', with its constituent Name header set to 
the text passed in 'name\Zft also creates default SeqNo, Entries, EscMode and Fields headers in 
the file. The contents of th^ EscMode header may be altered from the default of RAW using the -e 
option, whilst the Fields heafltgr contains ID but may be added to using the -f option. 
The allowable options to O^f^wMAP are therefore 



-e esc Change the contents of EscMode header to esc. Default is RAW. 
-f fid Add the field fid to the FiAds header. 

LNewMap also supports the standara/SMions -A, -C, -D, -S, -T, and -V as described in section 4.4. 
17.32.3 Tracing 

LNewMap supports the following trace optics where each trace flag is given using an octal base 
00001 basic progress reporting 

Trace flags are set using the -T option or the TRACE>cronfiguration variable. 

(J) 
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17.33 LNorm 

17.33.1 Function 

The basic function of this tool is to renorniahse language models, optionally pruning the vocabulary 
at the same time or applying cutoffs or weighted discounts. 

17.33.2 Use 

LNorm is invoked by^the command line 
LNorm [options] i£LKF±le outLMFile 



This reads in the langua^-taodel inLMFile and writes a new language model to outLMFile, ap- 
plying editing operations (pntroUed by the following options. In many respects it is similar to 
HLMCOPY, but unlike HL]\^^pPY it will always renormalise the resulting model. 

-c n c Set the pruning thresli^i for n-grams to c. Pruning can be applied to the bigram and 
higher components of a m^^l /n^l). The pruning procedure will keep only rt-grams which 
have been observed more thaYf^ times. Note that this option is only applicable to count-based 
language models. ^ 

-d f Set weighted discount pruning ;ram to c for Seymore-Rosenfeld pruning. Note that this 

option is only applicable to count-m^ed language models. 

-f s Set the output language model fori^a^to s. Possible options are TEXT for the standard 
ARPA-MIT LM format, BIN for Entrotoe binary format and ULTRA for Entropic ultra format. 

-n n Save target model as rt-gram. 

-w f Read a word-list defining the output vocal^ary from f . This will be used to select the 
vocabulary for the output language model. 

LNorm also supports the standard options -A. -C, -E^j^^S^ -T, and -V as described in section 4.4. 

o 

17.33.3 Tracing 

LNorm supports the following trace options where each trajz^^g is given using an octal base 
00001 basic progress reporting. 

Trace flags are set using the -T option or the TRACE configuratior^^riable. 

o 
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17.34 LPlex 

17.34.1 Function 

This program computes the perplexity and out of vocabulary (OOV) statistics of text data using one 
or more language models. The perplexity is calculated on per-utterance basis. Each utterance in 
the text data should start with a sentence start symbol (<s>) and finish with a sentence end (</s>) 
symbol. The default values for the sentence markers can be changed via the config parameters 
STARTWORD and ENDWORD respectively. Text data can be supphed as an HTK Master Label File 
(MLF) or as plain te^ (-t option). Multiple perplexity tests can be performed on the same texts 
using separate n-gran^T^^mponents of the model(s). OOV words in the test data can be handled in 
two ways. By default tfe^robability of n-grams containing words not in the lexicon is simply not 
calculated. This is useful'^K testing closed vocabulary models on texts containing OOVs. If the -u 
option is specified, n-grams^ giving the probability of an OOV word conditioned on its predecessors 
are discarded, however, the.- wrobability of words in the lexicon can be conditioned on context 
including OOV words. The latt^ mode of operation relies on the presence of the unknown class 
symbol ( ! ! UNK) in the languag^model (the default value can be changed via the config parameter 
UNKNOWNNAME) . If multiple mod^tWp specified (-i option) the probability of an rt-gram will be 
calculated as a sum of the weight0dT>robabilities from each of the models. 

CO 

17.34.2 Use 

LPlex is invoked by the command line \y 
LPlex [options] langmodel labelFrl§, 

The allowable options to LPlex are as fo' 




-c n c Set the pruning threshold for n-grams foe. Pruning can be applied to the bigram (n=2) 
and trigram (n=3) components of the modei^^he pruning procedure will keep only n-grams 
which have been observed more than c times, ^^te that this option is only applicable to the 
model generated from the text data. 

-est Label t is made equivalent to label s. More pre(^ely t is assigned to an equivalence class 
of which s is the identifying member. The equivalei^B) mappings are applied to the text and 
should be used to map symbols in the text to symbols in the language model's vocabulary. 

-i w fn Interpolate with model fn using weight w. 

-n n Perform a perplexity test using the n-gram component oVt^ model. Multiple tests can be 
specified. By default the tool will use the maximum value oiA available. 

-o Print a sorted list of unique OOV words encountered in the t^^^^d their occurrence counts. 

-t Text stream mode. If this option is set, the specified test filesf"^!]^ be assumed to contain 
plain text. 

-u In this mode OOV words can be present in the n-gram context when predicting words in the 
vocabulary. The conditional probability of OOV words is still ignored. • 

-w f n Load word list in f n. The word list will be used as the restricting vocabufs^-for the perplex- 
ity calculation. If a word list file is not specified, the target vocabulary wUi/be constructed 
by combining the vocabularies of all specified language models. \-0 

-z s Redefine the null equivalence class name to s. The default null class name is ???. Any words 
mapped to the null class will be deleted from the text. 



LPlex also supports the standard options -A, -C, -D, -S, -T, and -V as described in section 4.4. 



17.34 LPlex 
17.34.3 Tracing 

LPlex supports the following trace options where each trace flag is given using an octal 

00001 basic progress reporting. 

00002 print information after each utterance processed. 
00004 display encountered OOVs. 
00010 display probability of each n-gram looked up. 



Trace flags are set using 



00020 print each utter^ce and its perplexity. 




-T option or the TRACE configuration variable. 




CO 




o 
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17.35 LSubset 

17.35.1 Function 

This program will resolve a word map against a class map and produce a new word map which 
contains the class-mapped words. The tool is typically used to generated a vocabulary-specific 
n-gram word map which is then supplied to LBuild to build the actual language models. 

All class symbols present in the class map will be added to the output map. The -a option 
can be used to set the maximum number of new class symbols in the final word map. Note that 
the word-class map r^)lution procedure is identical to the the one used in LSubset when filtering 
n-gram files. 

17.35.2 Use 

LSubset is invoked by typ'i^^he command line 

LSubset [options] inMap^^le classMap outMapFile 

The word map and class map arQ^^aded, word-class mappings performed and a new map is saved 
to outMapFile. The output map's^'aame will be set to 

Name = inMapNcmie7„7oClassMapName 

The allowable options to LSubset asr^jfis follows 

-an Set the maximum number of new that can be added to the output map (default 1000). 

V 

LSubset also supports the standard options ^CyC, -D, -S, -T, and -V as described in section 4.4. 

17.35.3 Tracing 

LSubset does not provide any trace options. Hoi^e^er, trace information is available from the 
underlying library modules LWMap and LCMap bv^eting the appropriate trace configuration 
parameters. 



'•6 



O 



o 

% 



Chapter 18 

Configu^tion Variables 



This chapter tabulates all con^gHration variables used in HTK. 

18.1 Configuration ^i^riables used in Library Modules 

Table 18.1: Lrmary Module Configuration Variables 



Module 


Name 


D^ult 


Description 






0 y> 


Trace setting 


HParm 
HWave 


SDURCEFDRMAT 


HTK 


File format of source 


A D 171^1? no i\/r A T" 

i AK(_rh i r UKMA i 




File format of target 


HLabel 
HAuDio 
HWave 
HParm 


SOURCERATE 


0.0 ^ 

c 


i^ample rate of source in 100ns units 


HParm 
HWave 


■T" A n /^T^T'n A ■ 1 ■ 1 ■ 

TARGETRATE 


0 . 0 


Sam^e rate of target in 100ns units 


HAuDio 


LINEOUT 


T 


Enable audio output to machine line output 


PHONESOUT 


T 


EnableVvVudio output to machine phones 
output • ^ 


SPEAKEROUT 


F 


Enable ao^T& output to machine internal 
speaker 


LINEIN 


T 


Enable audioMnp 


it from machine line input 


MICIN 


F 


Enable audio iiSp 


it from machine mic input 


HWave 


NSAMPLES 




Num samples ii^ 


^Iflsn file input via a pipe 


HEADERS I ZE 




Size of header in 


£l^\lien file 


BYTEORDER 




Define byte order ■\^A^C^r other 


STEREOMDDE 




Select channel: RIGH'TJeikLEFT 


HParm 


SOURCEKIND 


ANON 


Parameter kind of souitrlv- 


TARGETKIND 


ANON 


Parameter kind of targek 


MATTRANFN 




Input transformation file • 


SAVECOMPRESSED 


F 


Save the output file in comp^e^sed form 


SAVEWITHCRC 


T 


Attach a checksum to output ijjSI^meter file 


ADDDITHER 


0.0 


Level of noise added to input sig^l 


ZMEANSOURCE 


F 


Zero mean source waveform befcfse analysis 


WINDOWSIZE 


256000.0 


Analysis window 


size in 100ns units 


USEHAMMING 


T 


Use a Hamming window 


DOUBLEFFT 


F 


Use twice the required size for FFT 


PREEMCOEF 


0.97 


Set pre-emphasis 


coefficient 


LPCDRDER 


12 


Order of Ipc analysis 


NUMCHANS 


20 


Number of filterbank channels 


LOFREQ 


-1.0 


Low frequency cut-off in fbank analysis 
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Module 


|\To TTir> 




1 T^^" 1 r^in 




HIFREQ 


-1 . 0 


Hign ircqucncy cut-on in ibank analysis 


WARPFREQ 


1.0 


Frequency warping factor 


WARPLCUTOFF 




Lower frequency threshold for non-linear 
warping 


CMEANDIR 




Directory to nnd cepstrai mean vecotrs 


CMEANMASK 




Filename mask for cepstrai mean vectors 


CMEANPATHMASK 




Path name mask for cepstrai mean vectors, 
the matched string is used to extend CME- 
AJNDIR string 


VARS^yj.EDIR 




Directory to nnd cepstrai variance vectors 


VARSd^^MASK 




Filename mask for cepstrai variance vectors 


VARSCAt^ATHMASK 




Path name mask for cepstrai variance vec- 
tors, the matched string is used to extend 
VAKoL/AL/rjDiK String 


VAKbCAJ_.hr m ) 




Filename of global variance scaling vector 


LUJyiJr'Khbbr AO lr\) 


U.oo 


Amplitude compression factor for PLP 


HLabel 
HParm 


VlLUNrAi 


Y 


HTK VI compatibility setting 


HWave 
HShell 


NATURALREADORDER.V 


F 


bnable natural read order lor binary nles 


NATURALwRITEDRDEH^ 


JF 


Enable natural write order for binary files 


HParm 


USEPOWER 




Use power not magnitude in fbank analysis 


NUMCEPS 




Number of cepstrai parameters 


CEPLIFTER 




Cepstrai liltermg coemcient 


ENORMALISE 


T 


INormalise log energy 


ESCALE 


0.1 V) 


oil 

, Scale log energy 


SILFLDOR 


50.0 


^ Energy silence floor in dBs 


DELTAWINDDW 


2 


^elta window size 


ACCWINDOW 


2 


' A^eleration window size 


VQTABLE 


NULL 


^ -C "I Tf^ -1-11 

NauRe ol VQ table 


SIMPLEDIFFS 


F 


Us^'&jaiple differences for delta calculations 


RAWENERGY 


T 


Use xa,w energy 


AUDIDSIG 


0 


AudioQ^nal number for remote control 


USESILDET 


F 


11 1/*1 1j_j_ 

Enable sjDeech/ silence detector 


MEASURESIL 


T 


Measure ha^ground silence level 


DUTSILWARN 


T 


Print a warning message to stdout before 
measuring awiio levels 


SPEECHTHRESH 


9.0 


Threshold for(§p)eech above silence level (in 
dB) ^ 


SILENERGY 


0.0 


Average backgroi^^ noise level (in dB) - 
will normally be mea^red rather than sup- 
plied m connguration^ 


SPCSEQCOUNT 


10 


Window over which speedi/silence decision 
reached ^ ^ 


SPCGLCHCOUNT 


0 


Maximum number of fraijies marked as si- 
lence in window which is classified as speech 
whilst expecting start ol speecii 


SILSEQCOUNT 


100 


Number of frames classifieW^as silence 
needed to mark end of utterano^ 


SILGLCHCOUNT 


2 


Maximum number of frames marked as si- 
lence in window which is classified as speech 
whilst expecting silence 
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Module 




Dofaiilt 






SILMARGIN 


40 


IN umber oi extra irames inciuaed beiore and 
after start and end of speech marks from the 
speech/silence detector 


HLabel 


STRIPTRIPHONES 


F 


Enable triphone stripping 


TRANSALT 


0 


i liter all but specined label alternative 


■yo A 'MOT Tnr 


(J 


Filter all but specified label level 


T A OTTT O nTTfT 1 ' L' 

LADhLbLIUUlh 


NULL 


Select method for quoting in label files 


bUUKLhLAbhL 


HTK 


Source label format 


T^fiETLABEL 


HTK 


Target label format 


HMem 


PRO^f TSTAKS 


F 


Enable stack protection 


HMODEL 


CHKHI^I^EFS 


T 


Check consistency oi HJVLJVL deis 


savebTnary 


F 


(~1 TT1\ JTi 1 f * 1 * r 1 

Save HMM dels m binary format 


keepdiStinct 


F 


Keep orphan HMMs m distinct nles 


SAVEGLOBQPT^ 


T 


O '^1 TT1\ /n\ T 1 -C 

Save ^0 with HJVLM dels 


SAVEREGTRt^ 


F 


Save ~r macros with HMM defs 


SAVEBASECLJ^^ 


F 


Save ~b macros with HMM defs 


saveinputxfO^ 


T 


Save ~i macros with HMM defs 


ORPHANMACFILE^f^ 


NULL 


Last resort nle for new macros 


HMMSETKIND 


' NULL 


Kind of HMM Set 


ALLOWOTHERHMMS < 




Allow MMFs to contain HMM definitions 

which are not listed m the HMM List 


DISCRETELZERD 


% 


Map DLOGZERO to LZERO in output 
probability calculations 


HNet 


FORCECXTEXP 




Force triphone context expansion to 
jEet model names (is overridden by 
ALLOWCXTEXP) 


FORCELEFTBI 


F 


\^fce left biphone context expansion to get 
rri0^el names ie. don't try triphone names 


FORCERIGHTBI 


F 


-torc^ right biphone context expansion to 
get Tmdel names ie. don't try triphone 
names. 


ALLOWCXTEXP 


T 


Allow ^^text expansion to get model 
names ^ 


ALLOWXWRDEXP 


F 


Allow conte?^ expansion across words 


FACTDRLM 


F 


Factor langiKfgexnodel likelihoods through- 
out words rafber than applying all at tran- 
sition into wordC^his can increase accuracy 
when pruning is @Jit and language model 
likelihoods are relaiiR^ly high. 


CFWORDBOUNDARY 


T 


In word-internal tripn^e systems, context- 
free phones will be trem£d as word bound- 
aries ^ 


HRec 


FQRCEOUT 


F 


_torces the most likely partial hypothesis to 
be used as the recognition insult even when 
no token reaches the end oi Mm network by 
the last frame of the utterance^ 


HShell 


ABDRTONERR 


F 


Causes Hbrror to abort rather^pAn exit 


NONUMESCAPES 


F 


Prevent writing in 012 format <^ 


MAXTRYQPEN 


1 


Maximum number of attempts which will 
be made to open the same file 


EXTENDFILENAMES 


T 


Support for extended filenames 


HTrain 


MAXCLUSTITER 


10 


Maximum number of cluster iterations 


MINCLUSTSIZE 


3 


Minimum number of elements in any one 
cluster 
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Module 


Name 


Default 


Description 




BINARYACCFDRMAT 


T 


bave accumuiator nies in binary lormat 


HFB 


HSKIPSTART 


-1 


Start of skip over region (debugging only) 


HSKIPEND 


-1 


End of skip over region (debugging only) 


HFBLat 


MINFDRPROB 


10.0 


Mixture prunning threshold 


PROBSCALE 


1.0 


Scaling factor tor the state acoustic and ian- 
guage model probabilities 


LANGPROBSCALE 


1.0 


Additional scaling factor for language 
model probabilities 


LATfeQBSCALE 

4 


1.0 


Scaling factor for the lattice-arc and lan- 
guage model probabilities 


PHNINSg^^ 


0.0 


Insertion penalty for each phone 


NOSILENtlE 


F 


Ignore silence from reference transcription 
when using non-exact mrbj 


quinphone\5^ V 


F 


0 j_ * 1 11 / \ 1 "111 *i" 

Support qumphone model. (July available il 
compiled with the SUPPORT_QUINPHONE di- 
rective 


EXACTCORRECTNE§S=^ 


F 


Do exact version of MPE/MWE 


PHONEMEE 


T 


Set to TRUE tor MPE or MWE 


CALCASERRDR 


F 




MWE ( 


"^^^ 

^^..^ 


Set to TRUE tor MWE training 


MEECDNTEXT 




Use context when calculating accuracies 


USECDNTEXT 




Same as MEECDNTEXT 


iJMbLURKhLlNEbb 




Correctness of an inserted phone 


PDE 




Use partial distance elimination 


HAdapt 


USEBIAS 


F 


Specify a bias with linear transforms 


SPLITTHRESH 


1000. 

\ 


Minimum occupancy to generate a trans- 
-*form 


TRANSKIND 


MLLRMEAN 


^^'ansformation kind 


ADAPTKIND 


BASE 


xJse regression tree or base classes 


BLOCKSIZE 


full 


Bierck. structure of transform 


BASECLASS 


global 


Ma(;roname of baseclass 


REGTREE 




Maci^^me of regression tree 


MAXXFDRMITER 


10 


Maximum iterations for iteratively esti- 
mated ti;a*isforms 


MLLRDIAGCDV 


F 


Generate ci ^iagonal variance transform 
with MLLKmean transform 


SAVESPKRMDDELS 


F 


Oj_ j_1 IIJI 11 J_' ll*j_' j_ 

Store the ada^ed model set m addition to 
the transform^Q_j^ 


KEEPXFORMDISTINCT 


T 


Save transforins «^arate files rather than a 
TMF ^ 


MAXSEMITIEDITER 


10 


Maximum iteratioi»s_of model/transform 
updates for semitiedVsystems 


SEMITIEDMACRQ 


SEMITIED 


IViacroname for the semiuied transform 


INITNUISANCEFR 


T 


Initiliase nuisance dimeij^ions using b isher 
ratios 


NUMNUISANCEDIM 


0 


Number of dimensions to- j^move using 

TJT A V y 

xlL/UA ^-^i 


HMap 


Hff ft TIT' A TT 

MAPTAU 


10 


T for use with MAP estimatiosO 


MINEGS 


0 


Minimum observations to update state 


MINVAR 


0 


minimum variance floor value 


MIXWEIGHTFLOOR 


0 


MINMIX times this value is the prior floor 




HWAVEFILTER 




Filter for waveform file input 


HPARMFILTER 




Filter for parameter file input 
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Module 


|\To Tnr> 




1 li'^c/^T'i T^^" 1 r^in 


HShell 


HLANGMODFILTER 




T^*1j_ i" 1 11 £21 * j_ 

1^ liter lor language model tile input 


HMMLISTFILTER 




h liter lor HMM list nle input 


HMMDEFFILTER 




* 1 J f TT1\ /FIX 1 r* *j * C ^ ' 1 

h liter lor HMM dennition nle input 


HLABELFILTER 




T^'l^ -C T 1_ 1 -CI * ^ 

h liter lor Label nle input 


HNETFILTER 




b liter lor IN et work nle input 


HDICTFILTER 




1" liter lor Dictionary nle input 


LGRAMFILTER 




Filter for gram file input 


LWrj[APFILTER 




liter lor word map nle input 


L^fiPFILTER 




liter lor class map nle input 


LMTESJF I LTER 




i: liter lor text nle input 


HWAV^^ILTER 




T^"1j_ £ £ £21 j_ j_ 

1" liter lor wavelorm nle output 


HPARMftFlLTER 




liter lor parameter nle output 


HLANGMflDDFILTER 




liter lor language model nle output 


HMMLIST[I^:.JER 




Filter lor HMM list nle output 


HMMDEFOFI'LTER 




i liter lor HMM dennition nle output 


HLABELOFIL'I^E^ 




T^*1j_ £ Til £21 j_ j_ 

i liter lor Label nle output 


HNETOFILTER \^ 




T^'l-i- £ AT -1- 1 £21 i- -1- 

i liter lor Network nle output 


HDICTOFILTER 




i: liter lor Dictionary nle output 


LGRAMOFILTER 




Filter for gram file output 


LWMArUr iL 1 ER [ 


^ 


Filter for word map file output 


LCMAPOFILTER 




Filter for class map file output 


LMODEL 


RAwMITFDRMAT 


4^ 


Disable HIK escaping lor LM tools 


USEINTID 




Use 4 byte ID nelds to save binary models 


LWMap 


INWMAPRAW 




Disable HiK escaping lor input word lists 
^nd maps 


OTT'T'T TTVff A T\T^ ATT 

OUTwMAPRAW 


F < 


Disable HiK escaping lor output word lists 
maps 


IT* A TV TIT T T\ T\ 

STARTWORD 


<s> 


Sft^entence start symbol 


ENDwORD 


</ s> 


bex sentence end symbol 


LCMap 


INCMAPRAW 


F 


DiScfioia^HiK escaping lor input class lists 
and rs^aps 


TT" /^TVff A T\T^ ATT 

OUTCMAPRAW 


F 


Disable(M)i K escaping tor output class lists 
and maps^ 


UNKNDWNNAME 


! !UNK 


Set OOV qlags symbol 


UNKNDWNID 


1 


O i- 1 \ J 111 TT~\ 

Set unknowlrsyjnbol class ID 


L_PCALC 


UNIFLOOR 


1 


Unigram flooKcount 



KRANGE 


7 


Good- iurmg discounting range 


n(j_CUiUrr 


1 


n-gram cutoit (eg^jljij.LUiUrr ) 


DCTYPE 


TG 


Discounting typ" for Turing-Good or 
ABS for Absolute 


LGBase 
HLVLM 


CHECKORDER 

A T TTiyT T T~iT~i f~\ "Pi Ti A 

RAWMITFDRMAT 


F 
F 


Check N-gram orderiWin files 
Disable HiK escaping wr^LM tools 


TUT "\ my 1-1 
HLV KEC 


MAXLMLA 


on 


Maximum jump in LM lo^kahead per model 


BUILDLATSENTEND 


F 


Build lattice from single tc^en in the SEN- 
TEND node Q_ 


FORCELATOUT 


T 


Always output lattice, even wbec no token 
survived 


GCFREQ 


100 


Garbage collection period, unit I<^ame. 
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Table 18.2: Tool Specific Configuration Variables 



Module 


Name 


Default 


Description 


HCompV 


UPDATEMEANS 


F 


Update means 


SAVEBINARY 


F 


Load/Save in binary format 


MINVARFLOOR 


0.0 


Minimum variance floor 


HCOPY 


NSTREAMS 


1 


Number of streams 


SA\^vq 


F 


Save only the VQ indices 


SOURCgFORMAT 


HTK 


File format of source 


TARGEl^RMAT 


HTK 


File format of target 




ANON 


Parameter kind of source 


TARGETKMD 


ANON 


Parameter kind of target 


HERest 


SAVEBINAR'^y^ 


F 


Load/Save in binary format 


BINARYACFOR^T 


T 


Load/Save accumulators in binary format 


ALIGNMODELMfllLv. 




MMF file for alignment (2-model reest) 


ALIGNHMMLIST 




Model list for alignment (2-model reest) 


ALIGNMnOELDIR 




Dir co'nt?^iTii'nP' Fri\/[l\/rs for alipTiment (2- 
model reest). 


ALIGNXFORMDIR C\ 


1 


Ext to be used with cibove Dir (2model- 
reest) 


ALIGNXFORMEXT 




Input transform ext to be used with 

9model-rppst 


ALIGNXFORMDIR 




Input transform dir to be used with 2model- 
reest 


INXFORMMASK 


c 


Input transform mask (default output 
jtransform mask) 


PAXFORMMASK 


N 


R^jrent transform mask (default output par- 
>ent^mask) 


UPDATEMODE 




w(^ -p 0 choose mode: UDATE update 
mo(^^ (default), DUMP dump sum of accu- 
mulatxSTs, BOTH do both 


HHEd 


TREEMERGE 


T 


After "tree splitting, merge leaves 


TIEDMIXNAME 


TM 


Xied niDctiw^ base name 


APPLYVFLOOR 


T 


Apply vapiance floor to model set 


USELEAFSTATS 


T 


Use stats t^^^dotain tied state pdf 's 


MMFIDMASK 


* 


Used with R(^^HEd command 




VARFLDORPERCENTILE 


0 


IVIaximum numkJ^r of Gaussian components 
(as the perceMl^R of the total Gaussian 

fmn'nn'np'nt'i; in f Vif^^AAstpm i tn nyinprcrn v?tri- 

ance floor "^K^ 


c 


1 . 0 


value for botVi cofniinnpnt wpiP"Vits and 

transition probabilities^^^^date 


CW 


1.0 


C value for component weights update 


CT 


1.0 


C value for transition probabilities update 


MINOCC 


10 


Minimum occupancy coun^'^'for Gaussian 
means and variances C3 . 


MINOCCTRANS 


10 


Minimum occupancy counts f^i>^ransition 
probabilities 


MINOCCWEIGHTS 


10 


Minimum occupancy counts for component 
weights 


SAVEBINARY 


F 


Save HMM models in binary format 


E 


2.0 


Scaling factor for the denominator counts 
to determine D-smoothing constant value 
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Module 




Default 




HMMIRest 


DFACTOROCC 


2.0 


Scaling factor for the D-smoothing constant 
value required to yield positive variances 


HCRIT 


1.0 


Scaling factor tor the denominator statistics 


MPE 


F 


Use MFhj criterion 


MWE 


F 


TT TV /TTTrX^ * j * 

Use MWE criterion 


MEE 


F 


T T IV /T TT TT^ ' i ' * f TV /TTT TT~\ ' i 1 mn ttt-i 

Use MWE criterion if MWE is set to TRUE 
for HFBLat. Otherwise, use MPE crite- 
rion 


MLE 


F 


Use MLE criterion 




F 


Use dynamic MMI prior 


MMITAUI^ 


0.0 


T J 1 ■ J J 1 r TV /TTV /TT 

i-smoothmg constant value for MlVii prior 


ISMOOTHTOT" 


0.0 


I-smoothing constant value for Gaussian 
means and variances 


ICRITOCC ^ . 


0.0 


Same as ISMODTHTAU 


I SMOOTH! AUT y\) 

Q 


0.0 


I-smoothing constant value for transition 
probabilities 


ISMOOTHTAUW ^ 




I-smoothing constant value for component 
weights 


PRIORI AU 


-1^ 


Prior smoothing constant value for Gaus- 
sian means and variances 


PRIORI AUW 


O.Ov^ 




Prior smoothing constant value for compo- 
nent weights 


PRIORI AUI 


0.0 


Prior smoothing constant value for transi- 
tion probabilities 


PRIORK 


0.0 


Weights (between 0 and Ij on the prior. 


MIXWEIGHIFLDOR 


2.0 

^ 


Component weights floor (as the number of 
'times of MINMIX) 


LATFILEMASK 


NULL ^ 


,^ask for lattice filename 


LATMASKNUM 


NULL 


^ Mask for numerator lattice directory 


LATMASKDEN 


NULL 


?^sk for denominator lattice directory 


INXFORMMASK 


NULL 


SfS^aker mask for loading input adaptation 
tran^^rms 


PAXFORMMASK 


NULL 


Speaker mask for loading parent adaptation 
transfonBS 


770177 7 77 


F 


Load lartice^in LLF format 


777^7^ A T'T7MnrM7 


UrMUUh_UrUA i h 


Update i^de. Possible values are 
UrMUUh._Urmg/yi, UrMUUhJJUflr or 
UPMDDE_BDTH (Y 


HParse 


VICOMPAT 


F 


Enable compatibffiJty with HiK VI. A 


HResults 


REFLEVEL 


0 


Till 1 j_ 1 1 1 r 

Label level to be^pd^d as reference 


lESILEVEL 


0 


Label level to be seored 


STRIPCONIEXT 


F 


Strip triphone cont&sKs 


IGNORECASE 


F 


If enabled, converts labels to uppercase 


NISISCORE 


F 


Use INISi fomattmg ^ 


niTO A OT7T A OT?T 


bhJM i 


Label for phrase level statistics 


DUnMT7T AOT7T 


T TnDr\ 
WUKU 


Label for word level statistics 




MTTT T 


If set then report on a per speaker basis 


HVlTE 


DT7^nTTT^nO L' L' T V 

RhuUU irKhr lA 


NULL 


Prefix for direct audio outputr^&me 


RECOUISUFFIX 


NULL 


Suffix for direct audio output name 


SAVEBINARY 


F 


Save transforms as binary 


HLStats 


DISCOUNT 


0.5 


Discount constant for backoff bigrams 


HLiST 


AUDIOSIG 


0 


Audio signal numberfor remote control 


SDURCERATE 


0.0 


Sample rate of source in 100ns units 
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Module 




Default 


T^o^rri Tit 1 nn 




iKAUr, 


A 

u 


Trace setting 


HDecode 


USEHMODEL 


F 


Use adaptation data structure and likeli- 
hood calulation routine of HModel 


STARTWORD 


<s> 


Word used as the start of network 


ENDWORD 


</s> 


Word used as the end of network 


FASTLMLABEAM 


off 


Fast language model look ahead beam 




Chapter 19 

Error ar^ Warning Codes 

When a problem occurs in an5^,JCrK tool, either error or warning messages are printed. 

If a warning occurs then a Ynessage is sent to standard output and execution continues. The 
format of this warning message isr^s^oUows: 

WARNING [-nnnn] Function: "''^ief Explanation' in HTool 

The message consists of four parts(^On the first line is the tool name and the error number. 
Positive error numbers are fatal, whilst'^gative numbers are warnings and allow execution to 
continue. On the second line is the funCTiajr in which the problem occurred and a brief textual 
explanation. The reason for sending warninggtto standard output is so that they are synchronised 
with any trace output. 

If an error occurs a number of error mess^^s may be produced on standard error. Many of 
the functions in the HTK Library do not exit ^i^nediately when an error condition occurs, but 
instead print a message and return a failure valuevbkCk to their calling function. This process may 
be repeated several times. When the HTK Tool that^^alled the function receives the failure value, 
it exits the program with a fatal error message. Thus tj^e displayed output has a typical format as 
follows: V^i^ 

ERROR [+nnnn] FunctionA: 'Brief expleoiation' 
ERROR [+nnnn] FunctionB: 'Brief explcination' 
ERROR [+nnnn] FunctionC: 'Brief expleoiation' * 
FATAL ERROR - Terminating program HTool 

V 

Error numbers in HTK are allocated on a module by module ami tool by tool basis in blocks of 
100 as shown by the table shown overleaf. Within each block of rCp>numbers the first 20 (0 - 19) 
and the final 10 (90-99) are reserved for standard types of error wM^i are common to all tools and 
library modules. ^ > 

All other codes are module or tool specific. '^K^ 

O 

19.1 Generic Errors O 



+??00 Initialisation failed 

The initialisation procedure for the tool produced an error. This could be ay*e>.to errors in the 
command line arguments or configuration file. 

-|-??01 Facility not implemented 

HTK does not support the operation requested. 

+??05 Available memory exhausted 

The operation requires more memory than is available. 

+??06 Audio not available 

The audio device is not available, either there is no driver for the current machine, the library 
was compiled with NO JiUDlO set or another process has exclusive access to the audio device. 
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HCopy 


1000- 


1099 


HList 


1100- 


1199 


HLEd 


1200- 


1299 


HLStats 


1300- 


1399 


HDMan 


1400- 


1499 


HSLab 


1500- 


1599 



HCompV 
HInit 
HRest 
iERest 
«iSniooth 
MQuant 

HBuild\>. 

HParse 
HVite 
HResults 
HSGen 
HLRescore 



LCMap 
LWMap 

LUtil 
LGBase 
LModel 
LPCalc 
LPMcrgc 



2000-2099 
2100-2199 
2200-2299 
2300-2399 
2400-2499 
2500-2599 
2600-2699 



3000-3099 
3100-3199 
(^00-3299 
^0-3399 
3'ffla;3499 
40TO-4100 



TO-^IC 



15000-^099 
15100- B^9 
15200-15^ 
15300-15399}. 
15400-1549^ 
15500-15599 
15600-15699 



HShcU 
HMem 
HMath 
HSigP 

HAudio 
HVQ 
HWave 
HParm 
HLabel 

HGraf 

HModel 
HTrain 
HUtil 
HFB 
HAdapt 

HDict 
HLM 
HNet 
HRec 
HLat 



LAdapt 
LPlex 
HLMCopy 
• Cluster 
LLink 



5000-5099 
5100-5199 
5200-5299 
5300-5399 

6000-6099 
6100-6199 
6200-6299 
6300-6399 
6500-6599 

6800-6899 

7000-7099 
7100-7199 
7200-7299 
7300-7399 
7400-7499 

8000-8099 
8100-8199 
8200-8299 
8500-8599 
8600-8699 



16400- 
16600- 
16900- 
17000- 
17100- 
17200- 



16499 
16699 
16999 
17099 
17199 
17299 



0[ewMap 

CO 

The file may not exist or the filter through 

'•6 




ay not exist or be writable by 
e set correctly. 



+??10 Cannot open file for reading 

Specified file could not be opened for reading 
which it is read may not be set correctly. 

-I-?? 11 Cannot open file for writing 

Specified file could not be opened for writing. The director 
the user or the filter through which the file is written may n"oi 

+??13 Cannot read from file \^ 

Cannot read data from file. The file may have been truncated, ^.^^rrectly formatted or the 
filter process may have died. ^-^ 

+nU Cannot write to file 

Cannot write data to file. The file system is full or the filter process has died. 

-t-??15 Required function parameter not set 

You have called a library routine without setting one of the arguments. 

+??16 Memory heap of incorrect type 

Some library routines require you to pass them a heap of a particular type. 

-|-??19 Command line syntax error 

The command line is badly formed, refer to the manual or the command summary printed 
when the command is executed without arguments. 

-t-??9? Sanity check failed 

Several functions perform checks that structures are self consistent and that everything is 
functioning correctly. When these sanity checks fail they indicate the code is not functioning 
as intended. These errors should not occur and are not correctable by the user. 



19.2 Summary of Errors by Tool and Module 



330 



19.2 

HCOPY 
+1030 

±1031 

+1032 
-1089 

HLiST 
HLEd 
+1230 

±1231 

+1232 

-1289 

HLStats 
+1328 

±1330 
-1389 

HDMan 
±1430 

±1431 
±1450 

±1451 



Summary of Errors by Tool and Module 



Non-existent part of file specified 

HCoPY needed to access a non-existent part of the input file. Check that the times are 
specified correctly, that the label file contains enough labels and that it corresponds to 
the data file. 

Label file formatted incorrectly 

HCoPY is.^iy able to properly copy label files with the same number of levels/alternatives. 
When using^^bels with multiple alternatives only the first one is used to determine seg- 
ment boundari^^ 

Appending files of different type/size/rate 

Files that are jomed together must have the same parameter kind and sample rate. 



Edit script syntax error 
The HLEd command script ciSntains a syntax error, check the input script against the 



ALIEN format set^^ 

Input/output formatf^^ been set to ALIEN, ensure that this was intended. 

)r ^ 
script cSntaji 

descriptions of each command iKsM;tion 17.10 or obtained by running HLEd -Q. 
Operation invalid 

You have either exceeded HLEd lim^^OTi the number of boundaries that can be specified, 
tried to perform an operation on a nonexistent level or tried to sort an auxiliary level 
into time order. None of these operatio^^re supported. 

Cannot find pronunciation 

The dictionary does not contain a valid p^i^nunciation (only occurs when attempting 
expansion from a dictionary) . 

ALIEN format set 

Input/output format has been set to ALIEN, en^iite that this was intended. 



Load/Make HMMSet failed - 
The model set could not be loaded due to either an eax>T opening the file or the data 
within being inconsistent. 

No operation specified 

You have invoked HLStats but have not specified an opei^tl^o)^ to be performed. 

ALIEN format set Q 
Input format has been set to ALIEN, ensure that this was intended^ 



o 



Limit exceeded 

HDMan has several built in limits on the number of different pronun^a^tion, phones, 
contexts and command arguments. This error occurs when you try to<^xceed one of 
them. 

Item not found 

Could not find item for deletion. Check that it actually occurs in the dictionary. 
Edit script file syntax error 

The HDMan command script contains a syntax error, check the input script against the 
descriptions of each command in section 17.5 or obtained by running HDMein -Q. 

Dictionary file syntax error 

One of the input dictionaries contained a syntax error. Ensure that it is in a HTK 
readable form (see section 12.7). 



19.2 Summary of Errors by Tool and Module 



331 



±1452 

HSLab 
-1589 

HCompV 
+2020 

+2021 

+2028 

+2030 
+2039 
+2050 

-2089 

HInit 
+2120 

+2121 

+2122 

+2123 

+2124 

+2125 

+2126 
+2127 
+2128 
+2129 



Word out of order in dictionary error 

Entries in the dictionary must be sorted into alphabetical (ASCII) order. 



ALIEN format set 

Input/output format has been set to ALIEN, ensure that this was intended. 



HMM doe^ipt appear in HMMSet 

Supplied HI^M filename does not appear in HMMSet. Check correspondence between 
HMM filenai^nd HMMSet. 



Not enough data to calculate variance 

There are not enough frames of data to evaluate a reliable estimate of variance. Use 
more data. ^ > 

Load/Make HMMSepailed 

The model set could ^ be loaded due to either an error opening the file or the data 
within being inconsisteirO 

Needs continuous models ^ 

HCompV can only operate ^models with an HMM set kind of PLAINHS or SHAREDHS. 
Speaker pattern matching faiitire 

The specified speaker pattern (?^u^ not be matched against a given untterance file name. 
Data does not match HMM yA^ 

An aspect of the data does not match, the equivalent aspect in the HMMSet. Check the 
parameter kind of the data. ^ y, 

ALIEN format set C 

Input format has been set to ALIEN, en\ure that this was intended. 

% 

Unknown update fiag Q ^ 

Unknown fiag set by -u option, use combinatio^^ of tmvw. 

Too little data , 

Not enough data to reliably estimate parameters. more training data. 
Segment with fewer frames than model states 

Segment may be too short to be matched to model, do use this segment for training. 

Cannot mix covariance kind in a single mix /-\ 

Covariance kind of all mixture components in any one swSe_must be the same. 

Bad covariance kind 

Covariance kind of mixture component must be either FULECNer DIAGC. 

No best mix found ^^-^v- 
The Viterbi mixture component allocation failed to find a mostvikely component with 
this data. Check that data is not corrupt and that parameter vaiues produced by the 
initial uniform segmentation are reasonable. 

No path through segment (~) 

The Viterbi segmentation failed to find a path through model with this Check that 

data is not corrupt and that a valid path exists through the model. ^ 

Zero occurrence count 

Parameter has had no data assigned to it and cannot be updated. Ensure that each 
parameter can be estimated by using more training data or fewer parameters. 

Load/Make HMMSet failed 

The model set could not be loaded due to either an error opening the file or the data 
within being inconsistent. 

HMM not found 

HMM missing from HMMSet. Check that the HMMSet is complete and has not been 
corrupted. 
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+2150 
+2170 

-2189 

HRest 
+2220 

+2221 
+2222 
+2223 
-2225 

+2226 
+2228 
+2250 

-2289 

HERest 
+2320 

+2321 
-2326 
+2327 
+2328 



Data does not match HMM 

An aspect of the data does not match the equivalent aspect in the HMMSet. Check the 
parameter kind of the data. 

Index out of range 

Trying to access a mixture component or VQ index beyond the range in the current 
HMM. 

ALIEN format set 

Input format has been set to ALIEN, ensure that this was intended. 
Unknown updi^flag 

Unknown flag set by -u option, use combinations of tmvw. 
Too few training ^^^ples 

There are fewer traopi^g examples than the minimum set by the -m option (default 3). 
Either reduce the vakie specified by -m or use more training examples. 

Zero occurrence count^^^^ 

Parameter has had no daxa assigned to it and cannot be updated. Ensure that each 
parameter can be estimat^^by using more training data or fewer parameters. 

Floor too high ^ j 

Mix weight floor has been set''sp)high that the sum over all mixture components exceeds 
unity. Reduce the floor value. 

Defunct Mix X.Y.Z 

Not enough training data to re-esti^^te the covariance vector of mixture component Z 
in stream Y of state X. The weight oP^he mixture component is set to 0.0 and it will 
never recover even with further training. , , 

No training data ^ i( 

None of the supplied training data couldN3e>used to re-estimate the model. Data may 



be corrupt or has been floored. 



6 



Load/Make HMMSet failed 
The model set could not be loaded due to eitl(^ an error opening the file or the data 
within being inconsistent. , 

Data does not match HMM 

An aspect of the data does not match the equivalent\^8^ect in the HMMSet. Check the 
parameter kind of the data. ^-^ 

ALIEN format set 



Input format has been set to ALIEN, ensure that this waS^^^nded. 

o 



Unknown update flag 

Unknown flag set by -u option, use combinations of tmvw. 

Load/Make HMMSet failed 
The model set could not be loaded due to either an error opening file or the data 
within being inconsistent. 

No transitions <^ 
No transition out of an emitting state, ensure that there is a transition path from begin- 
ning to end of model. 

Floor too high 

Mix weight fioor has been set so high that the sum over all mixture components exceeds 
unity. Reduce the floor value. 

No mixtures above floor 

None of the mixture component weights are greater than the floor value, reduce the floor 
value. 
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-2330 
-2331 

-2389 

HSmooth 
+2420 

+2421 

+2422 

+2423 

-2424 

+2425 

-2427 

-2428 

+2429 

HQUANT 

+2530 
+2531 

HHEd 

+2628 

±2630 
-2631 



Zero occurrence count 

Parameter has had no data assigned to it and cannot be updated. Ensure that each 
parameter can be estimated by using more training data or fewer parameters. 

Not enough training examples 

Model was not updated as there were not enough training examples. Either reduce the 
minimum specified by -m or use more data. 

ALIEN format set 

Input format has been set to ALIEN, ensure that this was intended. 
Unknown updi^flag 

Unknown flag set by -u option, use combinations of tmvw. 



nly J* 



HSmooth can onljpa^ used if HMM set kind is either DISCRETE or TIED. 
Too many monophones.\n list 

HSmooth is limited tS^JMMSets containing fewer than 500 monophones. 
Different number of statS&rfor smoothing 

Monophones and context-oenendent models have differing numbers of states. 
No transitions 

No transition out of an emittiiig stete, ensure that there is a transition path from begin- 
ning to end of model. ^\ 

Floor too high 

Mix weight floor has been set so hi^|^hat the sum over all mixture components exceeds 
unity. Reduce the floor value. 

Zero occurrence count 

Parameter has had no data assigned to ^<^nd cannot be updated. Ensure that each 
parameter can be estimated by using more gaining data or fewer parameters 

Not enough training examples 
Model was not updated as there were not end" 
minimum specified by -m or use more data. 

Load/Make HMMSet failed 

The model set could not be loaded due to either 
within being inconsistent. 




training examples. Either reduce the 



r 



error opening the file or the data 



^eijc 

O 

Stream widths invalid 

The chosen stream widths are invalid. Check that these mtft^ the parameter kind and 
are specified correctly. Q 

Data does not match codebook 

Ensure that the parameter kind of the data matches that of the codebook being gener- 
ated. 

% 

Load/Make HMMSet failed V 

The model set could not be loaded due to either an error opening the file or the data 

within being inconsistent. 

Tying null or different sized items 

You have executed a tie command on items which do not have the appropriate structure 
or the structures are not matched. Ensure that the item list refers only to the items that 
you wish to tie together. 

Performing operation on no items 

The item list was empty, no operation is performed. 



19.2 Summary of Errors by Tool and Module 



334 



+2632 

+2634 

+2635 
-2637 
-2638 

-2639 
+2640 
+2641 

+2650 

+2651 

±2655 

+2660 

+2661 

+2662 

±2663 

HBUILD 

±3030 
±3031 



Command parameter invalid 

The parameters to the command are invahd either because they refer to parts of the 
model that do not exist (for instance a state that does not appear in the model) or 
because they do not represent an acceptable value (for instance HMMSet kind is not 
PLAINHS, SHAREDHS, TIEDHS or DISCRETEHS). 

Join parameters invalid or not set 

Make sure than the join parameters (set by the JO command) are reasonable. In par- 
ticular take care that the floor is low enough to ensure that when summed over all the 
mixture components the sum is below 1.0. 

Cannot finomatching item 

Search for sp^ffied item was unsuccessful. When this occurs with the CL or MT commands 
ensure that th'^^propriate monophone/biphone models are in the current HMMSet. 

Small gConst • 

A small gConst in^J^tes a very low variance in that particular Gaussian. This could be 
indicative of over-tr^d^ng of the models. 

No typical state (\) 

When tying states toge^i^s a search is performed for the distribution with largest variance 
and all tied states share mis. distribution. If this cannot be found the first in the list will 
be used instead. 

Long macro name \t\ 

In general macro names shouI^Hiot exceed 20 characters in length. 
Not implemented . 

You have asked HHEd to perfori^^ function that is not implemented. 
Invalid stream split 

The specified number/ width of the strums does not agree with the parameter kind/ vector 
size of the models. 

Edit script syntax error 

The HHEd command script contains a synt^ error, check the input script against the 
descriptions of each command in section 17^^^ obtained by running HHEd -Q. 

Command range error /r\ 

The value specified in the command script is out of range. Ensure that the specified 
state exists and the the value given is valid. * ^ 

Stats file load error ^ \ 

Either loading occupation statistics for the second t)^e or executing an operation that 
needs the statistics loaded without loading them. 

Trees file syntax error 

The trees file format did not correspond to that expected. S^^re that the file is complete 
and has not been corrupted. 

Trees file macro/question not recognised 

The question or macro referred to does not exist. Ensure that tb^ file is complete and 
has not been corrupted. ^ 

Trying to sythesize for unknown model 

There is no tree or prototype model for the new context. Ensure tWta tree has been 
constructed for the base phone. 

Invalid types to tree cluster <^ 

Tree clustering will only work for single Gaussian diagonal covariance untied models of 

similar topology. 



Mismatch between command line and language model 

Ensure that the ! ENTER and !EXIT words are correctly defined and that the supplied 
files are of the appropriate type. 

Unknown word 

Ensure that the word list corresponds to the language model/lattice supplied. 
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HParse 
±3130 

±3131 

±3132 

±3150 

HVlTE 

±3228 

±3230 
±3231 

±3232 
±3233 

-3289 

HResults 
-3330 

±3331 

±3332 
±3333 

-3389 

HSGen 
-3420 

HLRescore 
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HVite is not able to perform tne^pperation requested 
Data does not match HMMs 



Variable not defined 

You have referenced a network that has not yet been defined. Check that all networks 
are defined before they are referenced. 

Loop or word expansion error 

There is either a mismatch between the WD_BEGIN WD_END pairs or a triphone loop is 
badly formed. 

Dictionary ^rror 

When genSsd^iig ^ dictionary a word exceeded the maximum number of phones, a word 
occurred twi^Jor no dictionary was produced. 

Syntax error inSiParse file 

The HParse network definition contains a syntax error, check the input file against the 
network descriptiq^^n section 17.16. 

Load/Make HMMSet fa%i 

The model set could not loaded due to either an error opening the file or the data 
within being inconsistent.^^ 

Unsupported operation 

There is a mismatch between thevoata file and the HMMSet. Ensure that the data is 
parameterised in the correct forma^^^d the configuration parameters match those used 
during training. 

MMF Load Error y^' 

The HMMSet does not contain a well-for^i&i regression class tree. 
Transcription empty 

In alignmnet mode a segment had an empty^t?^anscription and no boundary word was 
specified. /r\ 

ALIEN format set 

Input/output format has been set to ALIEN, ensur^^at this was intended. 

Empty file ^^^-v 
The file was empty and will be skipped. 

Unknown label '^'^^ 

The label did not appear in the list supplied to HResults. '^T^iA error will only occur if 
calculating confusion matrices so normally the contents of the ^^rd list file will have no 
effect on results. 

Too many labels • 
HResults will only generate confusion statistics for a small numbei^T)^ labels. 

Cannot calculate word spot results Q 

When calculating word spotting results the label files need to have both ^i^es and scores 

present. < 

ALIEN format set 

Input format has been set to ALIEN, ensure that this was intended. 



Network malformed 

The word network is malformed. The information in a node (word and following arcs) 
is set incorrectly. 
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HMem 
+5170 

+5171 
+5172 
+5173 
+5174 
+5175 



-4089 ALIEN format set 

Input/output format has been set to ALIEN, ensure that this was intended. 

HShell 

+5020 Command line processing error 

+5021 Command line argument type error 

+5022 Command line argument range error 

The commaud line is badly formed. Ensure that it matches the syntax and values 
expected 1^-the command (check the manual page or the syntax obtained by running 
HTOOL witl^pt any arguments). 

+5050 Configuration ^» format error 

HShell was unable to parse the configuration. Check that it is of the format described 
in section 4.3. 

+5051 Script file format errm 

Check that the scripffils is just a list of file names and that if any file names are quoted 
that the quotes occur ^^pAirs. 

+5070 Module version syntax e™*. 

A module registered witn^HShell with an incorrectly formatted version string (which 
should be of the form " !HV^)podule: Vers.str [WHO DD/MM/YY] "). 

+5071 Too many configuration parar»CTer^ 

The size of the buffer used bj\one of the tools or modules to read its configuration 
parameters was exceeded. Eithef^duce the total number of configuration parameters 
in the file or make more of then sp^^^c to their particular module rather than global. 

+5072 Configuration parameter of wrong ty^^ 

The configuration parameter is of the wr^fig type. Check that its type agrees with that 
shown in chapter 18. 



+5073 Configuration parameter out of range 



The configuration parameter is out of rangi 



<3) 



Heap parameters invalid 
You have tried to create a heap with unreasonable ^^utameters. Adjust these so that the 
growth factor is positive and the initial block size is'yfo larger than the maximum. For 
MSTAK the element size should be 1. Q 

Heap not found 

The specified heap could not be found, ensure that it has-^^t been deleted or memory 



overwritten. 



<6 



Heap does not support operation 
The heap is of the wrong type to support the requested operation^n particular it is not 
possible to Reset or Delete a CHEAP. ^ 

Wrong element size for MHEAP * ^ 

You have tried to allocate an item of the wrong size from a MHEAP. i^Hl Items on a MHEAP 

must be of the same size. 

Heap not initialised <^ 

You have tried to allocate an item on a heap that has not yet been created. Ensure that 

CreateHeap is called to initialise the heap before any items are allocated from it. 

Freeing unseen item 

You have tried to free an item from the wrong heap. This can occur if the wrong heap 
is specified, the item pointer has been corrupted or the item has already been freed 
implicitly by a Reset/DeleteHeap call. 



HMath 
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+5220 

+5270 
+5271 

HSigP 
+5320 

+5321 

-5322 

HAuDio 
+6020 

+6021 
+6070 
-6071 

HVQ 

+6150 

+6151 
+6170 

+6171 

+6172 
+6173 

+6174 



Singular covariance matrix 

The covariance matrix was not invertible. This may indicate a lack of training data or 
linearly dependent parameters. 

Size mismatch 

The input parameters were of incompatible sizes. 
Log of negative 

Result would be logarithm of a negative number. 

No results fd^aveToLPC 

Call did not in^ijde Vectors for the results. 

Vector size misnjatch 

Input vectors wer^J)|' mismatched sizes. 

Clamped samples dwfmg zero mean 



During a zero mean 
range. 



ration samples were clipped as they were outside the allowable 



when one was not attached, 
ing silence 



Replay buffer not active 
Attempt to access a replay bu ^. - 

Cannot StartAudio without measuriii 

An attempt was made to start ai^io input through the silence detector without first 
measuring or supplying the backgrdij^d silence values. 

Audio frame size/rate invalid 

The choice of frame period and windo^^^uration are invalid. Check both these and the 
sample rate. 

Setting speech threshold below silence 

The thresholds used in the speech detector^^^ve been set so that the threshold for 
detecting speech is set below that of detecting ^^nce. 



'•6 



VQ file format error » 

The VQ file was incorrectly formatted. Ensure that tfte file is complete and has not been 
corrupted. \y 

VQ file range error 

A value from the VQ file was out of range. Ensure that th^^le is complete and has not 
been corrupted. ^""^ 
Magic number mismatch 

The VQ magic number (normally based on parameter kind) d(^i^ not match that ex- 
pected. Check that the parameter kind used to quantise the data and create the VQ 
table matches the current parameter kind. 

VQ table already exists 
All VQ tables must have distinct names. This error will occur if yoir;5i*' to create or 
load a VQ table with the same name as one already loaded. <^ 

Invalid covariance kind 

Entries in VQ tables must have either NULLC, FULLC or INVDIAGC covariance kind. 
Node not in table 

A node was missing from the VQ table. Ensure that the VQ table was properly created 
or that the file was complete. 

Stream codebook mismatch 

The number or size of streams in the VQ table does not match that requested. 



o 



HWave 
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+6220 

+6221 
+6230 

+6250 
+6251 

+6252 

+6253 

+6254 

+6270 
+6271 

HParm 
+6320 

+6321 

+6322 

+6323 

+6324 
+6325 
+6328 

+6350 

-6351 



Cannot fseek/ftell 

Unless the wave file is read through a pipe fseek and ftell are expected to work correctly 
so that HWave can calculate the file size. If this error occurs when using an input pipe, 
supply the number of samples in the file using the configuration variable NSAMPLES. 

File appears to be a infinite 

HWave cannot determine the size of the file. 

Config parameter not set 

A necessary configuration parameter has not been set. Determine the correct value and 
place this i^^he configuration file before re-invoking the tool. 

Premature ej^ of header 

HWave coulo^ot read the complete file header. 
Header contains invalid data 

HWave was unableJ;o successfully parse the header. The header is invalid, of the wrong 
type or be a variaKon^that HWave does not handle. 

Header missing essenoal data 

The header was missr^ a> piece of information necessary for HWave to load the file. 
Check the processing of^e input file and re-process if necessary. 

Premature end of data 

The file ended before all the(H^ta was read correctly. Check that the file is complete, has 
not been corrupted and wher^^^cessary NSAMPLES is set correctly. 

Data formated incorrectly y^' 

The data could not be decoded p^^^i^rly. Check that the file was complete and processed 
correctly. 

File format invalid 



peratit 



The file format is not valid for the operatipn requested 
Attempt to read outside file ^ 

Mt 



You have tried to read a sample outside 



rtie waveform file. 

(J) 



Configuration mismatch 
The data file does not match the configuration. @heck the configuration file is correct. 

Invalid parameter kind , 
Parameter kind is not valid. Check the configuratioiii(fiIe. 

Conversion not possible o 

The specified conversion is not possible. Check the con^^X'ation is correct and re-code 
the data from waveform files if necessary. 

Audio error 

An audio error has been detected. Check the HAuDiO con/i^ration and the audio 
device. 



Buffer not initialised 

Ensure that the buffer is used in the correct manner. 



o 

% 



Silence detection failed 

The silence detector was not initialised correctly before use. 

Load/Make HMMSet failed 
The model set could not be loaded due to either an error opening the file or the data 
within being inconsistent. 

CRC error 

The CRC does not match that of the data. Check the data file is complete and has not 
been corrupted. 

Byte swapping not possible 

HParm will attempt to byte swap parameter files but this may not work if the floating 
point representation of the machine that generated the file is different from that which 
is reading it. 
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+6352 

+6370 

+6371 
+6372 

+6373 

+6374 

+6375 

+6376 

HLabel 
+6520 

+6521 

+6550 
+6551 
+6552 
±6553 
+6554 

+6570 
+6571 
+6572 

HMODEL 

+7020 
+7021 



File too short to parameterise 

The file does not contain enough data to produce a single observation. Check the file is 
complete and not corrupt. If it is, it should be discarded. 

Unknown parameter kind 

The specified parameter kind is not recognised. Refer to section 5.18 for a list of allowable 
parameter kinds and qualifiers. 

Invalid parameters for coding 

The chosen parameters are not valid for coding. Choose different ones. 
Stream wi^is not valid 

Cannot split^^he data into the specified number of streams. Check that the parameter 
kind is correCT^^d matches any models used. 

Buffer/observanon mismatch 

The observation*parameter kind should match that of the input buffer. Check that the 
configuration pargj^ter kind is correct and matches that of any models used. 

Buffer size too smaD3er window 

Calculation of delta jijc^meters requires a window larger than the buffer size chosen. 
Increase the size of the 

Frame not in buffer 

An attempt was made to ao^s a frame that does not appear in the buffer. Make sure 
that the file actually contains the specified frame. 

Mean/ Variance normalisation faikd 

The mean or variance normalisation vector from the file specified by the normalisation 
dir and mask cannot be applied, ^^^ke sure the file format is correct and the vectors are 
of the right dimension. tw- 

MLF index out of range 

An attempt was made to access an MLF ihett has not been loaded or to load too many 
MLFs. 

fseek / ftell not possible ^ 

HLabel needs random access to MLFs. This emSr is generated when this is not possible 
(for instance if access is via a pipe). • \ 

HTK format error tj 

MLF format error 

TIMIT format error O 
ESPS format error 
SCRIBE format error 

A label file was formatted incorrectly. Label file formats ai'^vi^cribed in chapter 6. 
Level out of range 

Attempted to access a non-existent label level. Check that the correct label file has been 
loaded. , 

Label out of range 

Attempted to access a non-existent label. Check that the correct f^tei^l file has been 
loaded and that the correct level is being accessed. 

Invalid format 

The specified file format is not valid for the particular operation. 



Cannot find physical HMM 

No physical HMM exists for a particular logical model. Check that the HMMSet was 
loaded or created correctly. 

INVDIAG internal format 

Attempts to load or save models with INVDIAG covariance kind will fail as this is a purely 
internal model format. 
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±7023 

+7024 

+7025 

+7030 

+7031 

±7032 
+7035 
+7036 

+7037 

+7050 
+7060 

+7070 

+7071 

HTrain 
+7120 

+7150 

+7170 
+7171 

+7172 
-7173 



varFloor should be variance floor 

HMODEL reserves the macro name varFloorN as the variance floor for stream N. These 
should be variance macros (type v) of the correct size for the particular stream. 

Variance tending to 0.0 

A variance has become too low. Start using a variance floor or increase the amount of 
training data. 

Bad covariance kind 

The particular functionality does not support the covariance kind of the mixture com- 
ponent. 

HMM set ingdmplete or inconsistent 

The HMMSet^jntained missing or inconsistent data. Check that the file is complete 
and has not been corrupted. 

HMM parameters inconsistent 

Some model paramet^s were inconsistent. Check that the file is complete and has not 
been corrupted. 

Option mismatch \ 

All HMMs in a particulfc^feet must have consistent options. 
Unknown macro ^) 

Macro does not exist. Chec^^at the name is correct and appears in the HMMSet. 
Duplicate macro 

Attempted to create a macro the same name as one already present. Choose a 
different name. ' 

Invalid macro 

Macro had invalid type. See section'^^^^describes the allowable macro types. 
Model file format error 

HMM List format error /A 

The file was formated incorrectly. Check tne^le is complete and has not been corrupted. 
Invalid HMM kind 

Invalid HMMSet kind. Check that this is specified correctly. 



tion 7 J 



Observation not compatible with HMMSet 
Attempted to calculate an observation likelihood for^^n observation not compatible with 
the HMMSet. Check that the parameter kind is sek^^ectly. 

o 

Clustering failed 

Almost certainly due to a lack of data, reduce the number oK^Xsters requested or increase 
amount of data. ^""^ 
Accumulator file format error (3 

Cannot read an item from an accumulator file. Check that fi^^s complete and not 
corrupted. 

Unsupported covariance kind 

Covariance kind must be either FULLC, DIAGC or INVDIAGC. 



o 
o 



Item out of range 

Attempt made to access data beyond expected range. Check that the it&n number is 
correct. 



Tree size must be power of 2 

Requested codebook size must be a power of 2 when using tree based clustering. 
Segment empty 

Empty data segment in file. Check that file has not become corrupted and that the start 
and end segment times are correct. 



HUtil 
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+7220 
+7230 
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+7270 
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+7321 
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±7332 
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HDiCT 
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HLM 
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±8151 
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HMMSet empty 

A scan was initiated for a HMMSet with no members. 
Item list parse error 

The item Hst syntax was incorrect. Check the item hst specification in section 17.8. 
Item hst type error 

Each item in a particular list should be of the same type and size. 
Stats file format error 

Stats file is of wrong format. Note the format of the stats file has changed in HTK_V2.0 
and old fil^will need converting to the new format. 

Stats file mo^ error 

A model name^^countered in the stats file is invalid check that the model set corresponds 
to that used to'generate the stats file and that the stats file is complete and has not been 
corrupted. • 

Accessing non-exi ^n> macro 

Attempt to perform deration on non-existent macro. 
Member id out of ran^^^ v 

Attempt to perform set^'weration on out of range member. 
Unknown model 

Model in HMM List not foun^^flMMSet, check that the correct HMM List is being 
used. 

Invalid output probability < 0\ 

Mixture component probability hasMoL-been set. This should not occur in normal use. 
Beta prune failed on taper ^ . , 

Utterance is possibly too short for mimjniini duration of model sequence. Check tran- 
scription. 

No path through utterance 

No path was found on the beta training pass(^^lax the pruning threshold. 
Empty label file 

No labels found in label file, check label file. ^ 
Single-pass retraining data mismatch 

Paired training files must contain the same number (rf observations. Use original data to 
re-parameterise. ^ 

HMM with unreachable states ^ 
HMM has an unreachable state, check transition matrisO^ 

Transition matrix with discontinuity 
Check transition matrix. 

Data does not match HMM 

An aspect of the data does not match the equivalent aspect in tl^ iIMMSet. Check the 
parameter kind of the data. , 

o 

Dictionary file format error vO. 

The dictionary file is not correctly formatted. Section 12.7 describes the HTK dictionary 

file format. 



LM syntax error 

The language model file was formatted incorrectly. Check the file is complete and has 
not been corrupted. 

LM range error 

The specified value(s) for the language model probability are not valid. Check the input 
files are correct. 
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HNet 
+8220 

-8221 
+8230 

+8231 
+8232 

+8250 

+8251 
+8252 
+8253 

HRec 
±8520 

+8521 

+8522 
±8570 
+8571 

HLat 
8621 



No such word 

The specified word does not exist or does not have a valid pronunciation. 
Duplicate pronunciations removed 

During network generations duplicate identical pronunciations of the same word are 
removed. 

Contexts not consistent 

HNet can pjily deal with the standard HTK method for specifying context lef t-phone+right 
and will ony^^Uow context free phones if they are context independent and only form 
part of the ^^d. This may be indicative of an inconsistency between the symbols in 
the dictionary'^id the hmms as defined. There may be a model/phone in the dictionary 
that has not bee^ defined in the HMM list or may not have a corresponding model. See 
also section 12.8 (^j^ontext expansion. 

No such model 

A particular model c!mld not be found. Make sure that the network is being expanded in 
the correct fashion ancrm^n ensure that your HMM list will cover all required contexts. 

Lattice badly formed \r\ 

Could not convert lattice w'network. The lattice should have a single well defined start 
and a single well defined en d^he n cross word expansion is being performed the number 
of !NULL words that can be CCJnJcatenated in a string is limited. 

Lattice format error 

The lattice file is formatted incorsg^ly. Ensure that the lattice is of the format described 
in chapter 20. 

Lattice file data error y\ 

The value specified in the lattice file is^myalid. 

Lattice file with multiple start/end nodes-^ 

A lattice should have only one well defined start node and one well defined end node. 

Lattice with invalid sub lattices >^ 

The sub lattices referred to by the main lattice^^re malformed. 



'•6 



Invahd HMM ^ \ 

One of the HMMs in the network is invalid. Check tft^ the HMMSet has been correctly 
initialised. (3 

Network structure invalid 

The network is incorrectly structured. Take care to avoi^J^ops that can be traversed 
without consuming observations (this may occur if you i^i^^duce any 'tee' words in 
which all the models making up that word contain tee-transiticats^ . Also ensure that the 
recogniser and the network have been created and initialised corrogtly. 

Lattice structure invalid 

The lattice was incorrectly formed. Ensure that the lattice was created properly. 
Recogniser not initialised correctly 

Ensure the recogniser is initialised and used correctly. ^^-^K 
Data does not match HMMs 

The observation does not match the HMM structure. Check the parameter kind of the 
data and ensure that the data is matched to the HMMs. 



Lattice incompatible with dictionary 

The lattice refers to a pronunciation variant (filed v=) that doesn't exist in the current 
dictionary. 
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±8622 

8623 
8624 

-8630 
8631 

8632 
8690 
8691 

HGraf 
+6870 

LCMap 
+15050 

+15051 

+15052 

+15053 

+15054 

+15055 



+15056 
+15057 
+15058 



Lattice structure invalid 

The lattice does not meet the requirements for some operation. All lattices must have 
unique start and end nodes and for many operations the lattices need to be acyclic (i.e. 
be a Directed Acyclic Graph) . 

Start or end word not found 

The specified lattice start or end word could not be found in the dictionary. 
Lattice end node label invalid 

The lattice end node must either be labelled with !NULL or the specified end word 
(default: !^^T_END) 

LLF file not^und 

The specifiecT^jF file could not be found or isn't in the right format. 
Lattice not found in LLF file 

A lattice couldn't he found in the LLF file. Note that the order in the LLF file must 
correspond to the woer of processing. 

Lattice not found >r 

The specified lattice fn^cj^uld not be opened. 

Lattice operation not supported 

The requested operation i^^ot supported, yet. 



Lattice processing sanity ch\a.c«.faile 

During processing an internaKs^nity check failed. This should never happe 



Xll error 

Ensure that the DISPLAY variable is 
correctly. 



and that the Xll window system is configured 



Unlikely num map entries [n] in XYZ 
A negative or infeasibly large number of class QW> entries have been specified. 

ReadMapHeader: UNKxxx configs must be set for hdrless map 

There is no header on the map so you must set U^KWOWNID and UNKNOWNNAME. 

No name in XYZ > 
No NAME header in class map. \ 

Unknown escmode XYZ in XYZ ^ 
ESCMODE header must specify either HTK or RAW. 

Class name XYZ duplicate in XYZ ^"^^ 

Two classes in the class map have the same name, which isnkw allowed. 

Bad index n for class XYZ in XYZ 

A class index less than 1 or greater than or equal to BASE¥(^JRDNDX (defined at 
compile time in LWMap - default is 65536) was found in the cl^s map. If you need 
more than BASEWORDNDX classes then you must recompile HT?Kswith a new base 
word value. 



o 



)^1( 



Number of entries = n for class XYZ in XYZ 

There must be at least one member in each class - empty classes are not'<?(Ilowed. 

Bad type XYZ for class XYZ in XYZ 

Classes must be defined using either IN or NOTIN. 

A class is in its own exclusive list. This typically happens when a class map is specified 
as a plain list of words. Such list is by default assumed to be a list of words excluded 
from class !!UNK. The error is triggered when !!UNK is in the word list. !!UNK must be 
removed from the list. 



LWMap 
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+15150 
+15151 
+15152 
+15153 
+15154 
+15155 

LUtil 
+15250 

LGBase 
+15330 

+15340 

+15341 

+15345 

+15350 

LMODEL 

+15420 

+15430 

+15440 
+15445 

-15450 
+15450 

-15451 
-15460 



Word list /word map file format error 

Check that the word list/word map file is correctly formatted. 
Unlikely num map entries [n] in XYZ 

A negative or infeasibly large number of word map entries have been specified. 

No NAME header in XYZ 

No NAME header in word map. 

No SEQNO header in XYZ 

No SEQNO header in word map. 

Unknown ®^ode XYZ in XYZ 

ESCMODE-^der must specify either HTK or RAW. 
Word name duplicated in XYZ 

There are duplicate words in the word map, which is not allowed. 

\ 

Header format error (\) 

Ensure that word maps<^d/or n-gram files used by the program start with the appro- 
priate header. 

n-gram file consistency check faihjre 

The n-gram file is incompatible wij^ other resources used by the program. 
File XYZ is n-gram but inset is n-gTfai» 

The specified input gram file is not mjlie expected gram size. 
Requested N[n] greater than gram size [nJ , 

An n-gram was requested which was lar^i\than any of those supplied in the input files. 

n-grams out of order v . 

The input gram file is not correctly sorted. 

n-gram file format error \ 

Ensure that n-gram files used by the program a^[)formatted correctly and start with the 
appropriate header. , 

Cannot find n-gram component 

The internal structure of the language model is corrupte ' 
when an n-gram (a, b, c) is encountered without the pres 



1 This error is usually caused 
i^^of n-gram (a, b). 

Incompatible probability kind in conversion 

The currently used language model does not allow the required conversion operation. 
This error is caused by attempting to prune a model stored in^iilejiltra file format. 

Cannot prune models in ultra format ^ 
Pruning of language models stored in ultra file format is not suppcSrted. 

Word ID size error o 

Language models with vocabularies of over 65,536 words require the li^pf larger word 

identifiers. This is a sanity check error. 'r^ 

Word XYZ not in unigrams - skipping n-gram. 

There should be a unigram count for each word in other length grams. 
Language model file format error 

The language model file is formatted incorrectly. Check the file is complete and has not 
been corrupted. 

Extraneous line warning 

Extra lines were found on the end of a file and are being ignored. 
Model order reduced 

Due to the effects of pruning the model order is automatically reduced. 
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LPCalc 
+15520 

+15525 

-15540 
+15540 

LPMerge 

+15620 

LPlex 
+16620 

+16625 

+16630 
-16635 
-16640 
-16645 
+16650 

HLMCOPY 

+16920 

+16930 

Cluster 
+17050 



Unable to find FLEntry to attach 

Indicates that the LM data structures are corrupt. This is normally caused by NGram 
files which have not been sorted. 

Attempt to overwrite entries when attaching 

Indicates that the LM structure is corrupt. Ensure that the word map file used is suitable 
for decoding the NGram database files. 

n-gram cut^tfT out of range 

An inapplira^e cutoff was ignored. 

Pruning eiTOry, 

The pruning pMefmeters specified are not compatible with the parameters of the language 
model. • 

Unable to find word isjerny model 

Indicates that the targsOnodel vocabulary contains a word which cannot be found in 
any of the source models. 

symbol XYZ not in word list v>* 

The sentence start symbol, sentenpe end symbol and OOV symbol (only if OOVs are to 
be included in the perplexity calcwmtion) must be in the language model's vocabulary. 
Note that the vocabulary list is eithko SDecified with the -w option or is implicitly derived 
from the language model. 

Unable to find word XYZ in any model \^ 

Ensure that all words in the vocabulary specified with the -w option are present in 
at least one of the language models. 

Maximum number of unique OOVs reached 

Too many OOVs encountered in the input texi^^ 

Transcription file f n is empty ^ 
The label file does not contain any words. 

Word too long, will be split: XYZ 

The word read from the input stream is of over 200 ai^^cters. 

Text buffer size exceeded (n) /-n 

The maximum number of words allowed in a single utt^sance has been reached. 

Maximum utterance length in a label file exceeded (limit is compiled to be n tokens) 
No label file utterance end has been encountered within n tolce^ - perhaps this is a text 
file and you forgot to pass the -t option? \y 

Maximum number of phones reached o 
When HLMCoPY is used to copy dictionaries, the target dictionarj^phone table is 
composed by combining the phone tables of all source dictionaries. '<CApck that the 
number of different phones resulting from combining the phone tables'of the source 
dictionaries does not exceed the internally set limit. 

Cannot find definition for word XYZ 

When copying dictionaries, ensure that each word in the vocabulary list occurs in at 
least one source dictionary. 



Word XYZ found in class map but not in word map 

All words in the class map must be found in the word map too. 
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— 17051 Unknown word token XYZ was explicitly given with -u, but does not occur in the word 

map 

This warning appears if you specify an unknown word token which is not found in the 
word map. 

+17051 Token not found in word list 

Sentence start, end and unknown (if used) tokens must be found in the word map. 

+17052 Not all words were assigned to classes 

A classmap was imported which did not include all words in the word map. 

— 17053 Word XYZ^^jn word map but not in any gram files 

The stated "\jj^d will remain in whichever class it is already in - either as defaulted to or 
supplied via t^ input class map. 

(J) 
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Chapter 20 



HTK Standard Lattice Format 
(SLF) ^ 

20.1 SLF Files O 

Lattices in HTK are used for storing lamjltiple hypotheses from the output of a speech recogniser 
and for specifying finite state syntax neYsarorks for recognition. The HTK standard lattice format 
(SLF) is designed to be extensible and to^ce able to serve a variety of purposes. However, in order 
to facilitate the transfer of lattices, it inco'i^j^rates a core set of common features. 

An SLF file can contain zero or more swB^attices followed by a main lattice. Sub-lattices are 
used for defining sub-networks prior to their.jiffse in subsequent sub-lattices or the main lattice. 
They are identified by the presence of a SUBLiTTfeld and they are terminated by a single period 
on a line by itself. Sub-lattices offer a convenient way to structure finite state grammar networks. 
They are never used in the output word lattices ^^erated by a decoder. Some lattice processing 
operations like lattice pruning or expansion will destroy the sub-lattice structure, i.e. expand all 
sub-lattice references and generate one unstructured l«^^ce. 

A lattice consists of optional header information foHra^d by a sequence of node definitions and 
a sequence of link (arc) definitions. Nodes and links are numbered and the first definition line must 
give the total number of each. \V 

Each link represents a word instance occurring between •two^nodes, however for more compact 
storage the nodes often hold the word labels since these are fre'^j^ntly common to all words entering 
a node (the node effectively represents the end of several word in^t9A;es). This is also used in lattices 
representing word- level networks where each node is a word end, aatd each arc is a word transition. 

Each node may optionally be labelled with a word hypothesis^?md with a time. Each link has 
a start and end node number and may optionally be labelled wits2?«^word hypothesis (including 
the pronunciation variant, acoustic score and segmentation of the woit(^ypothesis) and a language 
model score. vtl) 

The lattice must have exactly one start node (no incoming arcs) and opevend node (no outgoing 
arcs). The special word identifier !NULL can be used for the start and eno-npide if necessary. 

20.2 Format ' q 

The format is designed to allow optional information that at its most detailed full identity, 

alignment and score (log likelihood) information at the word and phone level to aV^n calculation 
of the alignment and likelihood of an individual hypothesis. However, without scores or times the 
lattice is just a word graph. The format is designed to be extensible. Further field names can be 
defined to allow arbitrary information to be added to the lattice without making the resulting file 
unreadable by others. 

The lattices are stored in a text file as a series of fields that form two blocks: 

• A header, specifying general information about the lattice. 

• The node and link definitions. 
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20.3 Syntax 
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Either block may contain comment lines, for which the first character is a and the rest of 
the line is ignored. 

All non-comment lines consist of fields, separated by white space. Fields consist of an alphanu- 
meric field name, followed by a delimiter (the character or '"') and a (possibly "quoted") field 
value. Single character field names are reserved for fields defined in the specification and single 
character abbreviations may be used for many of the fields defined below. Field values can be 
specified either as normal text (e.g. a=-318 . 31) or in a binary representation if the character is 
replaced by The binary representation consists of a 4- byte floating point number (IEEE 754) or 
a 4-byte integer number stored in big-endian byte order by default (see section 4.9 for a discussion 
of different byte-orde^jp HTK). 

The convention usq^to define the current field names is that lower case is used for optional 
fields and upper case is lasted for required fields. The meaning of field names can be dependent on 



the context in which they^Sppear. 

The header must include a field specifying which utterance was used to generate the lattice and 
a field specifying the version^^ the lattice specification used. The header is terminated by a line 
which defines the number of nsrms and links in the lattice. 

The node definitions are optkinal but if included each node definition consists of a single line 
which specifies the node number jo^l^ed by optional fields that may (for instance) define the time 
of the node or the word hypothesisyeiiding at that node. 

The link definitions are required each link definition consists of a single line which specifies 
the link number as well as the start an^^nd node numbers that it connects to and optionally other 
information about the link such as the <¥vtn:d identity and language model score. If word identity 
information is not present in node definif^^jp'then it must appear in link definitions. 



20.3 Syntax 



The following rules define the syntax of an SLF- lathee. Any unrecognised fields will be ignored 
and no user defined fields may share the first chaii^ter with pre-defined field names. The syntax 
specification below employs the modified BNF notaJ^m used in section 7.10. For the node and arc 
field names only the abbreviated names are given aiui^aly the text format is documented in the 
syntax. \^ 

latticedef = laticehead ^0 
lattice { lattice } » 

latticehead = "VERS10N=" number ^ 

"UTTERANCE=" string O 
"SUBLAT=" string 

{ "vocab=" string I "hmms=" string I "Imnam^'V string | 
"wdpenalty=" floatnumber I "lmscale=" f loataniiiber I 

"acscale=" floatnumber | "base=" f loatnumber>-x"tscale='' floatnumber } 

<^ 



O 

% 



lattice = sizespec 
{ node } 
{ arc } 

sizespec = "N=" intnumber "L=" intnumber 

node = "1=" intnumber 

{ "t=" floatnumber I "W=" string I 

"s=" string I "L=" string I "v=" intnumber } 

arc = "J=" intnumber 
"S=" intnumber 
"E=" intnumber 

{ "a=" floatnumber I "1=" floatnumber I "a=" floatnumber I "r=" floatnumber 
"W=" string I "v=" intnumber I "d=" segments } 



20.4 Field Types 



349 



segments = " : " segment {segment} 

segment = string [ "," floatnumber [ "," floatnumber ]] 



20.4 Field Types 

The currently defined fields are as follows:- 



Field 

Header fields 
VERSIDN=y.s 
UTTERANCE=7,s 
SUBLAT=7.s 
acscale=yof 
tscale=7,f 
base=7„f 
lmname=7oS 
lmscale=7of 
wdpenalty=7of 



abbr o|c Description 

\ 

V Lattice specification adhered to 
o<^J(Jtter£m.ce identifier 



U 
S 



Sub-lattice name 
t 

Scaling factor for acoustic likelihoods 
Seeing factor for times (default 1.0, i.e.\ seconds) 
Log^se for Likelihoods (0.0 not logs, default base e) 
NameC^ Language model 
Scalinrg^actor for language model 
Word iiJ^OTtion penalty 



NDDES=7.d 


N 


c 


LINKS=7.d 


L 


c 


Node Fields 






I=7.d 






time=7of 


t 


0 


WDRD=7.s 


W 


wc 


L=7oS 




wc 


var=7od 


V 


wo 


s=7s 


s 


0 


Link Fields 






J=7.d 






START=7.d 


S 


c 


END=7.d 


E 


c 


WDRD=7.s 


W 


wc 


var=7od 


V 


wo 


div=7oS 


d 


wo 


acoustic=7of 


a 


wo 


language=7.f 


1 


0 


r=7f 


r 


0 



Lattice Size fields 

Number of fi^es in lattice 
Number of linl^ in lattice 

Node identifiers* Starts node information 
Time from start ox utteraince (in seconds) 
Word (If lattice Ick^ls nodes rather that links) 
Substitute named sub ttice for this node 
Pronunciation variant irtmiber 
Semantic Tag 

Link identifier. Starts li^ik information 
Start node number (of the li«^| 
End node number (of the link) ^> 
Word (If lattice labels links rather that nodes) 
Pronunciation variant number 

Segmentation (modelname, duratioij^Xikelihood) triples 
Acoustic likelihood of link \^ 
General lainguage model likelihood o^^^^^nk 
Pronunciation probability 

The word identity (and associated 'w' fields var.div and (fixoustic) must 
appear on either the link or the end node. • 



Note: 



abbr is a possible single character abbreviation for the f 
o|c indicates whether field is optional or compulsory. 

20.5 Example SLF file 



The following is a real lattice (generated by the HTK Switchboard Large Vocabulary System with a 
54k dictionary and a word fourgram LM) with word labels occurring on the end nodes of the links. 
Note that the !SENT_SENT and ! SENT_END "words" model initial and final silence. 

VERSI0N=1.0 

UTTERANCE=s22-0017-A_0017Af-s22_000070_000157.plp 
lmname=/home/ solveb/hub5/lib/lang/f gintcat_54khub500 . txt 



20.5 Example SLF Sle 



lmscale=12.00 wdpenalty=-10 . 00 

vocab=/home/ solveb/hub5/lib/ dicts/54khub500v3 . Ivx . dct 

N=32 L=45 

1=0 t=0 .00 W= ! MULL 

1=1 t=0.05 W= ! SENT_START v=l 

1=2 t=0.05 W= ! SENT_START v=l 

1=3 t=0.15 W= ! SENT_START v=l 

1=4 t=0.15 W= ! SENT_START v=l 

1=5 t=0.19 W=HOW v=l 

1=6 t=0.29 W=I^ v=l 

1=7 t=0.29 W=M ^ v=l 

1=8 t=0.29 W=HUr^^ v=l 

1=9 t=0.70 W=DH v=l 

1=10 t=0.70 W=0 • v=l 

1=11 t=0.70 W=KOMO v=l 

1=12 t=0.70 W=COMO v=l 

1=13 t=0.70 W=CUOMO v=l 

1=14 t=0.70 W=HELLO \^ 

1=15 t=0.70 W=DH yP v=l 

1=16 t=0.70 W=LOW v=l 

1=17 t=0.71 W=HELLO v=l 

1=18 t=0.72 W=HELLO v=l 

1=19 t=0.72 W=HELLO \J^^ 

1=20 t=0.72 W=HELLO "^Y^l 

1=21 t=0.73 W=CUOMO 

1=22 t=0.73 W=HELLO 

1=23 t=0.77 W=I v=l^ 

1=24 t=0.78 W=I'M v=l 

1=25 t=0.78 W=TO 

1=26 t=0.78 W=AND v=l 

1=27 t=0.78 W=THERE v=l 

1=28 t=0.79 W=YEAH v=l 

1=29 t=0.80 W=IS v=l 

1=30 t=0.88 W=!SENT_END v=l 

1=31 t=0.88 W=!NULL 

J=0 S=0 E=l a=-318.31 1=0.000 

J=l S=0 E=2 a=-318.31 1=0.000 

J=2 S=0 E=3 a=-1094.09 1=0.000 "Q 

J=3 S=0 E=4 a=-1094.09 1=0.000 

J=4 S=2 E=5 a=-1063.12 l=-5.496 

J=5 S=3 E=6 a=- 11 12. 78 l=-4.395 

J=6 S=4 E=7 a=- 1086. 84 l=-9.363 



v=l^ 
v=l ^ 
v=l 



CO 



•1^ 



J=7 S=2 E=8 a=- 1876. 61 l=-7.896 Q 



J=8 S=6 E=9 a=-2673.27 l=-5.586 

J=9 S=7 E=10 a=-2673.27 l=-2.936 

J=10 S=l E=ll a=-4497.15 1=-17.078 

J=ll S=l E=12 a=-4497.15 1=-15.043 

J=12 S=l E=13 a=-4497.15 1=-12.415 

J=13 S=2 E=14 a=-4521.94 l=-7.289 

J=14 S=8 E=15 a=-2673.27 l=-3.422 

J=15 S=5 E=16 a=-3450.71 l=-8.403 

J=16 S=2 E=17 a=-4635.08 l=-7.289 

J=17 S=2 E=18 a=-4724.45 l=-7.289 

J=18 S=2 E=19 a=-4724.45 l=-7.289 

J=19 S=2 E=20 a=-4724.45 l=-7.289 

J=20 S=l E=21 a=-4796.74 1=-12.415 

J=21 S=2 E=22 a=-4821.53 l=-7.289 

J=22 S=18 E=23 a=-435.64 l=-4.488 

J=23 S=18 E=24 a=-524.33 l=-3.793 



20.5 Example SLF Gle 

J=24 S=19 E=25 a=-520.16 

J=25 S=20 E=26 a=-521.50 

J=26 S=17 E=27 a=-615.12 

J=27 S=22 E=28 a=-514.04 

J=28 S=21 E=29 a=-559.43 

J=29 S=9 E=30 a=- 1394. 44 

J=30 S=10 E=30 a=-1394.44 

J=31 S=ll E=30 a=-1394.44 

J=32 S=12 E=30 a=-1394.44 

J=33 S=13 E=3^ a=-1394.44 

J=34 S=14 E=30^ a=-1394.44 

J=35 S=15 E=30 '^=-1394.44 

J=36 S=16 E=30 "^-1394.44 

J=37 S=23 E=30 *=-767.55 

J=38 S=24 E=30 a^^2.95 

J=39 S=25 E=30 a=-g^.95 

J=40 S=26 E=30 a=-6§S^5 

J=41 S=27 E=30 a=-692f^ 

J=42 S=28 E=30 a=-623.§^ 

J=43 S=29 E=30 a=-556.7{^ 

J=44 S=30 E=31 a=0.00 (S)L=0.000 



1= 


-4, 


.378 


1= 


-3, 


.435 


1= 


-4, 


.914 


1= 


-5, 


.352 


1= 


-1, 


.876 


1= 


-2, 


.261 


1= 


-1, 


.687 


1= 


-2, 


.563 


1= 


-2, 


.352 


1= 


-3, 


.285 


1= 


-0, 


.436 


1= 


-2, 


.069 


1= 


-2, 


.391 


1= 


-4, 


.081 


1= 


-3, 


.868 


1= 


-2, 


.553 


1= 


-3, 


.294 


1= 


-0, 


.855 


1= 


-0, 


.762 


1= 


-3, 


.019 



(J) 



'•6 



o 



Index 



ABDRTONERR, 53 <^ 
accumulators, 8, 129 ^ 
accuracy figure, 41 
ACCWINDDW, 68 
adaptation, 13, 43 • 

adaptation modes, 136 \^ 

CMLLR, 138 

generating transforms, 44 pA 

global transforms, 139 

MAP, 43, 143 V 

MLLR, 43, 137 

MLLR formulae, 144 

regression tree, 44, 139, 157 y-A 

supervised adaptation, 43, 136 

transform model file, 45, 141 \ > 

unsupervised adaptation, 43, 136 ^ 
ADDDITHER, 63 \^ 
ALIEN, 74 ^ 
all-pole filter, 63 
ALLDWCXTEXP, 177 
ALLOWXWRDEXP, 41, 177 
analysis 

FFT-based, 30 

LPC-based, 30 
ANON, 61 

ARPA-MIT LM format, 224, 225 
AT command, 35, 158 
AU command, 40, 41, 43, 156 
audio output, 75, 76 
audio source, 75 
AUDIOSIG, 76 

average log probability, 187 

back-off bigrams, 171 

ARPA MIT-LL format, 172 
backward probability, 8 
Baum- Welch algorithm, 8 
Baum- Welch re-estimation, 6, 7 

embedded unit, 127 

isolated unit, 126 
Bayes' Rule, 3 
beam width, 129, 183 
<BeginHMM>, 100 
binary chop, 164 
binary storage, 113, 150 
binning, 65 
Boolean values, 52 
bootstrapping, 10, 17, 120 
byte swapping, 56, 63 
byte-order, 63 



BYTEORDER, 63 

C-heaps, 55 
CEPLIFTER, 64, 66 
cepstral analysis 
filter bank, 65 
liftering coefficient, 64 
LPC based, 64 
power vs magnitude, 65 
cepstral coefficients 

liftering, 64 
cepstral mean normalisation, 66 
CFWORDBDUNDARY, 179 
CH command, 94 
check sums, 82 
CHKHMMDEFS, 100 
Choleski decomposition, 101 
CL command, 37, 150 
\«lass id, 220 
\ Class language models, 214, 226 
Cciass map 

as vocabulary Hst, 
^Complements, 222 
l^sfining unknown, 
Mader, 221 
cloning, 36^ 37, 150 
Cluste035-237 



223 



222 



ie)^i^g, 40 




cluster me 
clustering Q 

data-drivBftv 153 
tracing ilvi^ 
tree-based, 
CO command, 41^ 
codebook. 111 
codebook exponent, 
codebooks, 6 
coding, 30 * 
command line 

arguments, 31, 50 
ellipsed arguments, 51 vO^ 
integer argument formats,<^0 
options, 15, 50 
script files, 31 
compile-time parameters 
INTEGRITY.CHECK, 231 
INTERPOLATE_MAX, 231 
LMPROB_SHORT, 231 
LM.COMPACT, 231 
LM_ID_SHORT, 231 
SANITY, 231 
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compression, 82 
configuration files, 15, 51-53 

default, 51 

format, 52 

types, 52 
configuration parameters 

USEINTID, 230 

compile-time, 231 

operating environment, 56, 230 

switching, 130 
configuration variables<^5 

display, 53 ^ 

summary, 320 ^ 
confusion matrix, 189 • 
context dependent models, l'^^ 
continuous speech recognition ,yraO 
count encoding, 224 
Count-based language models, 2M\^ 
covariance matrix, 98 \^ 
cross-word network expansion, 180 ^0 
cross-word triphones, 150 

o 

data insufficiency, 38 \' 
data preparation, 16, 24 \ 
DC command, 94, 177 \) 
DE command, 93 \^ 
decision tree-based clustering, 155 ^ 



digit recogniser, 169 
direct audio input, 75 

signal control 
keypress, 76 

silence detector 
speech detector, 75 
DISCOUNT, 172 
DISCRETE, 78 
discrete data, 160 
discrete HMM 

output probability scaling. 111 
discrete HMMs, 111, 159 
discrete probability, 98, 110 
DISCRETEHS, 106 
DP command, 158 
duration parameters, 98 
duration vector, 115 



decision trees, 38 

loading and storing, 156 

decoder, 182 

alignment mode, 185 
evaluation, 186 
forced alignment, 190 
live input, 191 
N-best, 192 
operation, 182 
organisation, 184 
output formatting, 191 
output MLF, 187 
progress reporting, 186 
recognition mode, 185 
rescoring mode, 186 
results analysis, 187 
trace output, 187 

decompression filter, 56 

defunct mixture components 

defunct mixtures, 157 

deleted interpolation, 164 

deletion errors, 187 

delta coefficients, 68 

DELTAWINDOW, 68 

dictionaries, 165 

dictionary 

construction, 26, 175 
edit commands, 176 
entry, 26 
format, 26 
formats, 175 
output symbols, 175 



EBNF, 19, 169 
edit commands 

single letter, 93 
edit file, 93 

embedded re-estimation, 33 
embedded training, 10, 18, 120, 
<EndHMM>, 100 
energy suppression, 77 
COMPRESSFACT, 67 
ENDRMALISE, 52, 68 
\ environment variables, 56 
terror message 
format, 328 
err^Knumber 

a^ucture of, 328 
errormimbers 

stfuci^re of, 53 
errors, 

full uWg, 328 
ESCALE, 68 Q 
EX commandTfe. 190 
extended Badc)a&^Naur Form, 169 
extended filenaii^,^51 
extensions 
mfc, 30 
scp, 31 
wav, 27 



127 



<6 
O 



126 



Figure of Merit, 20, 189 
file formats 

ALIEN, 74 



O 

Audio Interchange (AIFP)-^ 3 

Esignal, 71, 72 

HTK, 69, 72 

NIST, 72 

NOHEAD, 74 

OGI, 74 

SCRIBE, 73 

Sound Designer(SDESl), 73 
Sun audio (SUNAU8), 73 
TIMIT, 72 
WAV, 74 
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1€5 



files 

adding checksums, 82 
compressing, 82 
configuration, 51 
copying, 81 

language models, 224, 226 
listing contents, 79 
network problems, 56 
opening, 56 
script, 50 

VQ codebook, 79 ^ 
filters, 56 ^ 
fixed-variance, 125 ^ 
fiat start, 17, 28, 32, 120 
fioat values, 52 
FoF file, 306 

counts, 224 

header, 224 
FoF files, 224 
FOM, 20, 189 
FDRCECXTEXP, 41, 177 
forced alignment, 13, 31, 190 
FORCELEFTBI, 177 
FORCEOUT, 187 
FORCERIGHTBI, 177 
forward probability, 7 
forward-backward 

embedded, 127 

isolated unit, 126 
Forward-Backward algorithm, 7 
frequency-of-frequency, 224 
full rank covariance, 100 



Gaussian mixture, 6 
Gaussian pre-selection, 78 
GCONST value, 116 
generalised triphones, 154 
global. ded, 176 
global options, 114 
global options macro, 123 
global speech variance, 119 
gram file 

input, 224 

sequencing, 224 
gram files 

count encoding, 223 

format, 223 

header, 223 
grammar, 169 
grammar scale factor, 41 
grand variance, 152 

Hamming Window, 62 
HAuDio, 15 
HAUDIO, 74 

HBuiLD, 19, 171, 173, 238-239 
HCompV, 17, 32, 119, 125, 240-241 
HCDNFIG, 51 

HCOPY, 16, 30, 51, 81, 161, 242-244 
HDecode, 19, 248 



HDiCT, 14 

HDMan, 19, 26, 176, 245-247 
HEAdapt, 13 
MEREST, 136 
headers, 220 
HEADERS I ZE, 74 

MEREST, 10, 18, 33, 120, 127, 149, 251-254 
MGraf, 15 

MMEd, 18, 34, 43, 121, 148, 255-264 
HIFREQ, 65 

MiNiT, 7, 9, 17, 50, 119, 265-266 
HK command, 163 
MLabel, 14 
MMEd, 37 

MLEd, 16, 29, 36, 93, 190, 267-269 

MLiST, 16, 79, 270 

MLM, 14, 171 

MLMCOPY, 271 

LNORM, 316 

MLRescore, 272-274 

MLStats, 16, 171, 275-276 

MMath, 14, 49 

MMem, 14, 49, 55 

HMM 

binary storage, 38 
build philosophy, 18 
cloning, 36 
* definition files, 32 
definitions, 4, 97 

<, editor, 18 
•^instance of, 11 
^parameters, 98 
(C^hones, 36 
MMM definition 

streaa^ weight, 102 
basis-form, 99 
binar^^;^orage, 113 
covariai^c^ matrix, 100 
formal syitfESx, 113 
global feann-es, 100 
global optioiffi^l4 
global optioi?9>jAacro, 101 
macro types, 1Q(^ 
macros, 104 
mean vector, 100 
mixture components, J4)0 
multiple data streams^. i02 
stream weight, 102 
symbols in, 99 ^"^^ 
tied-mixture, 110 
transition matrix, 100 
MMM lists, 95, 106, 107, 127 
HMM name, 100 
HMM refinement, 148 
HMM sets, 106 
types, 106 
HMM tying, 107 
HMMIRest, 277-280 

HMODEL, 14 
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HNet, 13, 14 
HPaxm 

SILENERGY, 75 

SILGLCHCOUNT, 75 

SILMARGIN, 75 

SILSEQCOUNT, 75 

SPCGLCHCOUNT, 75 

SPCSEQCOUNT, 75 

SPEECHTHRESH, 75 
HParm, 15 ^ 
HParse, 19, 25, 169, 3^284 
HParse format, 169 

compatibility mode, rTl 

inVl.5, 171 • 

variables, 170 \^ 
HQuANT, 16, 285-286 
HRec, 13, 15, 184 
HRest, 9, 17, 119, 287-288 
HResults, 20, 187, 289-292 
HSGen, 19, 27, 174, 293 
HShell, 14, 49 
HSigP, 14 
HSKind, 106 
HSLab, 16, 27, 294-297 
HSmooth, 19, 164, 298-299 
HTrain, 15 
HUtil, 15 
HLRescore, 19 

HViTE, 10, 13, 18, 19, 35, 41, 184, 300-302 
HVQ, 14 
HWave, 15 
HWAVEFILTER, 73 

insertion errors, 187 
integer values, 52 

Interpolating language models, 213 
<InvCovar>, 100 
isolated word training, 119 
item lists, 37, 151 

indexing, 151 

pattern matching, 151 

JO command, 163 
K-means clustering, 122 

label files, 86 

ESPS format, 88 

HTK format, 87 

SCRIBE format, 88 

TIMIT format, 88 
labels 

changing, 94 

context dependent, 95 

context markers, 88 

deleting, 93 

editing, 93 

external formats, 87 

merging, 94 

moving level, 95 



multiple level, 87 

replacing, 94 

side-by-side, 86 

sorting, 93 
LAdapt, 303-304 
LLiNK, 313 
LNewMap, 315 
language model scaling, 187 
language models 

bigram, 171 
lattice 

comment lines, 348 

field names, 348 

format, 19, 347 

header, 347 

language model scale factor, 193 

link, 347 

N-best, 12 

node, 347 

rescoring, 13 

syntax, 348 
lattice generation, 192 
lattices, 347 
LBuiLD, 305 
LFoF, 224, 306 
LGCOPY, 307-308 
LGLiST, 309 



^■LGPrep, 310-312 
^ ifcrary modules, 14 
^li^lihood computation, 4 
^ prediction, 63 
hepstra, 64 
LINEj^, 75 
LINEOUT, 75 
live input;^2 
<LLTC3«S^>, 101 
LM file foh^ts 

ARPA-i^ format, 224 
binary, 2^^226 
class, 22773^ 
class countsV^^ 
class probaMkies, 227 
ultra, 224 Q 
LMerge, 314 
LOFREQ, 65 
log arithmetic, 9 
LPC, 64 

LPCEPSTRA, 64 
LPCORDER, 64 
LPlex, 317-318 
LPREFC, 64 

LS command, 44, 156, 157 
LSubset, 319 
LT command, 41, 43, 156 

M-heaps, 55 
macro definition, 104 
macro substitution, 105 
macros, 35, 104 



O 
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special meanings, 105 

types, 105 
marking word boundaries, 150 
master label files, 29, 86, 89, 124, 127 

embedded label definitions, 89 

examples, 91 

multiple search paths, 89 

pattern matching, 90 

patterns, 29, 91 

search, 90 

sub-directory searQ^90 

syntax, 90 ^ 

wildcards, 90 ^ 
master macro file, 109 
master macro files, 32 

input/output, 149 
matrix dimensions, 100 
MAXCLUSTITER, 161 
maximum model limit, 187 
MAXTRYOPEN, 56 
ME command, 94 
<Mean>, 100 
mean vector, 98 
MEASURESIL, 191 
mcl scale, 65 
HIFREQ, 66 
LDFREQ, 66 
MELSPEC, 65 
WARPFREQ, 65 
WARPLCUTDFF, 66 
WARPUCUTDFF, 66 
memory 

allocators, 55 

element sizes, 55 

statistics, 55 
memory management, 55 
MFCC coefficients, 30, 100 
MICIN, 75 

minimum occupancy, 39 
MINMIX, 157, 163, 164 
<Mixture>, 100, 116 
mixture component, 98 
mixture incrementing, 156 
mixture splitting, 157 
mixture tying, 163 
mixture weight floor, 157 
ML command, 95 
MLF, 29, 86 
MMF, 32, 109 
model compaction, 41 
model training 

clustering, 153 

compacting, 154 

context dependency, 149 

embedded, 127 

embedded subword formulae, 134 
forward/backward formulae, 132 
HMM editing, 149 
in parallel, 129 



initialisation, 121 
isolated unit formulae, 134 
mixture components, 122 
pruning, 128 

re-estimation formulae, 131 

sub- word initialisation, 124 

tying, 150 

update control, 124 

Viterbi formulae, 131 

whole word, 123 
monitoring convergence, 124, 128 
monophone HMM 

construction of, 31 
MP command, 177 
MT command, 158 
MU command, 157 
mu law encoded files , 73 
multiple alternative transcriptions, 
multiple hypotheses, 347 
multiple recognisers, 184 
multiple streams, 76 

rules for, 76 
multiple-tokens, 13 



192 



N-best, 13, 192 
n-gram language model, 305 
N-grams, 171 
iJATURALREADDRDER, 56 
MTURALWRITEDRDER, 56 
VWb command, 153 
i(ejAvork type, 178 
ne^^rks, 165 

icyecognition, 166 
word-internal, 41 
new fccltu^s 

in '^ion 2.1, 22 
in Ves^ 3.1, 21 
in Versi^3.2, 21 
in Version^ 3, 21 
in Versio»=^4, 20 
ngram VJ . 

count encodi:^g)224 
files, 223 Q 
NIST, 20 A 
NIST format, 188 ^ 
NIST scoring software, 188 
NIST Sphere data formatCjfe 
non-emitting states, 10 
non-printing chars, 54 
NSAMPLES, 74 
NUMCEPS, 64, 66 
CMEANDIR, 67 
CMEANMASK, 67 
NUMCHANS, 52, 66 
VARSCALEDIR, 67 
VARSCALEFN, 67 
VARSCALEMASK, 67 
<NumMixcs>, 100, 112 
<NumStates>, 114 
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observations 

displaying structure of, 81 
operating system, 49 
outlier threshold, 39 
output filter, 56 
output lattice format, 193 
output probability 

continuous case, 6, 98 

discrete case, 98 
OUTSILWARN, 191 ^ 
over-short training segi^^nts, 126 



parameter estimation, 11 
parameter kind, 69 
parameter tie points, 105 
parameter tying, 37 
parameterisation, 30 
partial results, 187 
path, 10 

as a token, 11 
partial, 10 
Perplexity, 211 
phone alignment, 35 
phone mapping, 35 
phone model initialisation, 120 
phone recognition, 180 
phones, 120 
PHDNESOUT, 75 
phonetic questions, 155 
pipes, 49, 56 
PLAINHS, 107 
pre-emphasis, 62 
PREEMCOEF, 62 
Problem solving, 217 
prompt script 

generationof, 27 
prototype definition, 16 
pruning, 18, 33, 183, 187 

in tied mixtures, 163 
pruning errors, 129 

QS command, 39, 155 
qualifiers, 60, 69 

_A, 68 

_T, 68 

_C, 70, 83 

J), 68 

j:, 68 

J{, 70, 71, 83 

_N, 69, 77 

_0, 68 

_V, 78, 162 

_V, 82 

_Z, 66 

codes, 70 

ESIG field specifiers, 71 
summary, 84 

RAWENERGY, 68 

RC command, 44, 157 



RE command, 93 
realignment, 35 
recogniser evaluation, 41 
recogniser performance, 188 
recognition 

direct audio input, 42 

errors, 188 

hypothesis, 183 

network, 182 

output, 42 

overall process, 167 

results analysis, 42 

statistics, 188 

tools, 19 
recording speech, 27 
RECOUTPREFIX, 192 
RECOUTSUFFIX, 192 
refiection coefficients, 63 
regression formula, 68 
removing outliers, 154 
results analysis, 20 
RN, 158 

RN command, 44 
RO command, 39, 149, 154 
RP command, 177 
RT command, 158 

fiAVEASVQ, 78, 162 
_^VEBINARY, 113, 150 
^VECOMPRESSED, 82 
SW^EWITHCRC, 82 
sci^^ files, 50, 123 
sear(J!r>errors, 184 
segnifental k-means, 17 
sentence generation, 175 
sequencecLgram files, 224 
SH commaw<ff 154 
SHAREDHS, 1«B) 
short pause, 3i-N 
signals ^''^fs 

for recordin§)control, 191 
silence floor, 68 ^<\_) 
silence model, 34, 3^^52 
SILFLOOR, 68 A 
simple differences, 69 
SIMPLEDIFFS, 69 • 
single-pass retraining, 13cC^ 
singleton clusters, 154 
SK command, 158 
SLF, 19, 25, 165, 167 

arc probabilities, 169 
format, 167 
null nodes, 168 
word network, 168 
SO command, 93 
software architecture, 14 
SOURCEFDRMAT, 71, 72 
SOURCEKIND, 60, 191 
SOURCELABEL, 87, 95 
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SDURCERATE, 61, 75 

SP command, 177 
speaker identifier, 189 
SPEAKERDUT, 75 
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